A problem of aggregation of information

In a post below I regenerated the HGDP PCA plot you’ve probably seen around, except that I added my parents (and a few HapMap populations) into the plot. The PCA below was basically a visualization of the two largest independent dimensions of genetic variance in the data set. It wasn’t to scale, as the vertical African vs. non-African dimension is somewhat greater than the horizontal west vs. east dimension in magnitude. But, I argued that the positioning of my parents was deceptive as to their heritage. In the comments John Emerson offers a hypothesis to salvage the possibility that the PCA is telling us something informative about my parents’ relationship to the Uyghurs:

Was there ever an endogenous Mughal group in South Asia? If both your parents distantly came from that group, even though assimilated to the local populations for a few generations, the Uighur connection would be unremarkable.

Combing the plot I generated with the historical information this is an eminently plausible model. But we need to consider what the PCA is showing us. The position of my parents’ is a reflection of their average genetic variance in relation to two reference points. It doesn’t tell us necessarily about the constituents of that variation. By analogy, consider that the average of 2 and 4 is 3. But the average of 1 and 5 is also 3. This is basically what’s going on with my parents’ position on the plot. They, and the Uyghurs, converged on the same position through different routes.

But let’s dig deeper into the data. First, I generated a PCA plot with just Eurasian populations. That means that the largest dimension, which separates Africans from non-Africans, no longer exists with this data set. So you have a free dimension to work with. Here’s what we get:

Please note that the magnitude of the horizontal dimension is 4.5 that of the vertical dimension. The plot is not to scale. With that out of the way, the clustering of my parents with the Uyghurs now disappears when we add what is obviously a north-south Eurasian component to the variation. This dimension always existed, but it would have not have been PC 1 or PC 2 with Africans in the sample, so we didn’t see it. My parents are clearly still outside of the main axis of South Asian variation, though to some extent I think this is just sparse sampling (ergo, HAP).

PCA is probably not the best way to illustrate the issue at the end of the day. So I ran the HGDP data set with Africans excluded with K = 10 (10 putative ancestral populations in our model). I added my parents, Gujaratis, and Tuscans, along with a few friends who sent me their 23andMe data. I pruned the SNPs down to 55,000. I think this was probably a little on the low side, and the color cording on the plot is atrocious. But I want to sleep now, so let’s work this. Focus on the Uyghurs, Hazaras, and my parents. Compare the color proportions:

There are two East Asian distinctive components: a garish green one overrepresented among southern Chinese groups, and a orange one modal among the Siberians. The balance between these two components seems to follow a north-south gradient. Notice that the Uyghurs and Hazaras both exhibit parity between these two components. This makes sense considering the partial Mongolian origins of both these groups (the Turks originate in western Mongolia and southern Siberia). On the other hand, my parents are strongly biased toward the garish green “southern” East Asian component. This suggests their eastern element derives from a group related to those of southwest China or Burma.

As they say, I think “case closed.” At least when it comes to the genetics. Culturally the matter may be different. My trivial trace of Turanian ancestry shaped the self-conception of my ancestors for generations. In contrast, the substantial Southeast Asian component has literally been forgotten.

A problem of aggregation of information

Related Posts:

Related