How to look at population structure

A friend asked me about population structure, and methods to ferret it out and classify it. So here is a quick survey on the major methods I’m familiar with/utilize now and then. I’ll go roughly in chronological order.

First, you have trees. These are pretty popular from macroevolutionary relationships, but on the population genetic scale (intraspecific, microevolutionary) you’re mostly talking about representing distances between groups in a tree format. You saw this in History and Geography of Genes, where genetic distances in the form of Fst values (proportion of genetic variation unique to between two groups) were used as distance inputs.

A problem with trees is that they don’t model gene flow, a major dynamic on a microevolutionary scale. Also, complex relationships can get elided in tree frameworks, and as you add more and more populations you often end up with an incomprehensible fan-like topology.

Then you have principle component analyses (PCA) and related methods (e.g., multidimensional scaling, which is very different in the sausage-making but generates a similar output). Like trees, this is a visualization of the variation, in this case on a two dimensional plot (please don’t bring up three dimensional PCA, there’s no such thing until holograms show up).

The problem with PCA is that different types of dynamics can lead to the same result. For example, someone who is an F1 of two distinct groups occupies the same position as a population which happens to occupy a genetic position between two groups. Additionally, by constraining the variation into two dimensions, one can mislead in terms of relationships. There are many dimensions, but operationally you focus on on two at a time.

A paper of interest, Population Structure and Eigenanalysis.

Next you have model-based clustering introduced in Jonathan Pritchard’s Inference of Population Structure Using Multilocus Genotype Data. There are many flavors of this, but they operate under the same framework. You have a model of population dynamics, and see how the genotype data can be explained by parameters of the model. Of particular interest is assignment to one of K populations, which can be combined to explain the variation in the data.

Unlike PCA these model-based methods are rather good at identifying people who are first generation mixes, as opposed to those from stabilized groups along a cline. But, they also produce artifacts, because they are quite sensitive to the input data, and lend themselves to cherry-picking.

Earlier I said that one problem with the tree methods is that they don’t model gene flow. Joe Pickrell’s TreeMix does so. Like the original tree methods, and unlike PCA or unsupervised model-based clustering, you specify a set of populations. Then you compare the populations in terms of their genetic distance, and fit them to a tree, but add migration parameters to that tree where the fit between the tree and the data is the most tenuous fit.

All visualizations are deformations of reality. TreeMix attempts to mitigate this somewhat by introducing another representation, that of migration.

Next we have local ancestry methods. By local ancestry, basically we mean methods which can assign ancestry to particular regions of the genome. While tree methods measure differences across pooled populations, PCA and model-based methods compare genotypes between individuals (this is a simplification, but bear with me). Local ancestry methods, like RFMix, compare regions of the genome with each other.

Related to, but not exactly the same, as local ancestry methods are haplotype based methods. In particular, I’m thinking of the FineStructure and its related methods. These leverage variation across the genome in terms of haplotypes, rather than just looking at genotypes. They also tend to benefit from phasing, for obvious methods. FineStructure and its relatives tend to need more marker density than model-based methods, which require more marker density than PCA, which requires more marker density that tree based methods. These haplotype based methods allow for correction of and accounting for forces such as genetic drift, which tend to skew results in other methods.

Finally, there is the AdmixTools framework which is good for testing very explicit hypotheses. While many of the above methods, such as TreeMix and unsupervised model-based clustering, explore an almost open-ended space of structure possibilities, the methods in AdmixTools exists in large part to test narrow delimited models. This goes to the fact that many of these methods are complementary, and you should use them together to arrive at a robust result. For example, if you are assigning populations for TreeMix, you should use PCA and model-based clustering to make sure that the populations are clear and distinct, and outliers are removed.

There’s a lot I left out, but many of the other methods are just twists on the ones above.

Related Posts:

Related