Visualizing variation, input → output

I have noted a few times that one thing you have to be careful about in two dimensional plots which show genetic variance is that the dimensions in which the data are projected upon are often generated from the data itself. So adding more data can change the spatial relationships of previous data points. Additionally, in 23andMe’s global similarity advanced plot you are projected onto the dimensions generated from the HGDP data set. There are some practical reasons for this. First, it’s computationally intensive to recalculate components of variance every time someone is added to the data set. Second, it isn’t as if the ethnic identity of any given individual is validated. What would you do if an alien sent in a kit and spuriously put “French” as their ancestry?

So, in reply to this comment: “Let me rephrase: is there any difference when you switch to the world-wide plot? I imagine not, or you would’ve mentioned it.” Actually, there is a slight difference. Below on the right you have a “world view,” with my position being marked with green, and on the left a “zoom in” for Central/South Asia in the HGDP data set.

Because of the “business” of the plot it is hard to see the difference. But when I wasn’t “sharing” genes with people this is what you saw:

1) There is a definite gap between a Central Asian Hazara/Uyghur cluster and a South Asian one which consists of the Pakistani groups.

2) In the Central/South Asia zoom I’m in the gap between the two clusters, about 1/3 of the way toward the Central Asian cluster away from the South Asian cluster (the next closest individual shifted in that direction who isn’t a family member is Bangladeshi).

3) In contrast, in the world view I’m on the edge of the Central Asian cluster, toward the South Asian one, but definitely separated by a clean gap from it.

You can see some generalized differences between the two plots. The Central/South Asia view has a major linear cluster, with the Kalash a distinctive outgroup. In the world view this is not so, rather, you have a group of Pakistanis with non-trivial African admixture shifted in that direction (mostly Makrani, but one of the Sindhis in the HGDP data set seems to be a brownlatto!). Since there isn’t much African variance in the South Asian zoom aside from what the admixed individuals bring to the table naturally it doesn’t shake out as one of the two top dimensions. So what’s going on with me? I don’t have a good hypothesis, but I suspect that my likely Southeast Asian ancestry shifted me further toward the Asian cluster in the world view. There are some groups very closely related to the Burmese in the HGDP (e.g., Naxi) which are in the world view, and, naturally not in the Central/South Asia zoom. When you break ancestry into “European” and “Asian” components then the Hazara/Uyghur cluster is an OK substitute (both are hybrids, with “European” and “Asian” ancestry in about equal proportions), but this is actually a first approximation. These two groups have more “northern” Asian ancestry, while mine is more “southern.” Because of their inclusion in the Central/South Asia cluster the west-east dimension in Eurasia is constructed from more northern East Asian populations, which might underestimate my East Asian element.

There’s actually a much better example than me though who I’m sharing genes with. This individual is an ethnic Persian. Note that in the world view they seem to be on the margins of the European cluster, verging toward the Central/South Asia group. But when you do the Central/South Asia zoom view, they’re in that cluster! Note the very different positions. Their “neighbor” in the zoom view is totally different from their neighbor in the world view:

My argument for why I’m more “Asian” in the world view is that the world view has Asian groups to which I am closer, which are excluded in my zoom view. A much more extreme case seems to be happening with this Persian individual, whose family is from northern Iran and has an oral history of Russian ancestry on one of his lineages.

This is the sort of reason why I assume any reader who points to a paper and a plot and asserts that “this proves X” is somewhat cognitively challenged. The patterns in PCA aren’t necessarily arbitrary. But, they do need to be interpreted with care. One set of results isn’t dispositive of any given position in a debate, at least least until you get to the ridiculous boundary conditions (in some ways, I think of a lot of genetic data visualization like I think of regression. It’s how people use/interpret it that is problematic, not the method itself).

Finally, doesn’t it seem ridiculous to you that South Asians are being projected onto a plot where the dimensions are generated from liminal populations! Imagine, if you will, that Europeans were projected onto a plot generated from the variance of Finnic and Slavic groups only. That’s a good analogy. The Pakistani groups in the HGDP data set are not good representatives of South Asian genetic variation, because they’re shifted to the margins of the distribution. That’s one reason that the Harappa Ancestry Project is so needful (and why if you just got your v3 results and are Iranian, Tibetan, Burmese, or South Asian, you should send it in. And v2 folks as well!).

Visualizing variation, input → output

Related Posts:

Related