Substack cometh, and lo it is good. (Pricing)

PCA remains the swiss-army-knife to explore population structure


I put up a poll without context yesterday to gauge people about what methods they preferred when it came to population genetic structure.* PCA came out on top by a plural majority. More explicitly model-based methods, such as Structure/Admixture, come in right behind them. Curiously, the oldest method, pairwise Fst comparisons (greater Fst means more variance partitioned between the groups), and Treemix, the newest method, have lower proportions of adherence.

Why is PCA so popular? Unlike Treemix or pairwise Fst you don’t have to label populations ahead of time. You just put the variation in there, and the individuals shake out by themselves. Pairwise Fst and Treemix both require you to stipulate which population individuals belong to a priori. This means you often end up using PCA or some other method to do a pre-analysis stage. Structure/Admixture model-based methods make you select the number of distinct populations you want to explore, and often assume an underlying model of pulse admixture between populations (Treemix does this too when you have an admixture edge).

PCA is also better at smoking out structure than Structure/Admixture for the same number of markers, and, it’s pretty fast as well. This is why the first thing I do when I get population genetic data where I want to explore structure is do a PCA and look for clusters and outliers. After this pre-analysis stage, I can move onto other methods.

Further reading:

* I stipulated “genotyped-based” methods to set aside some of the new-fangled techniques, which often assume phasing and analysis of haplotypes, such as Chromopainter or explicit local ancestry deconvolution (some local ancestry deconvolution does not require phased haplotypes, but the most popular do).

2 thoughts on “PCA remains the swiss-army-knife to explore population structure

Comments are closed.