Substack cometh, and lo it is good. (Pricing)

D.I.Y. population structure inference, part 1 of many

If you’ve been reading this weblog for a while you’ve seen many images like the one above. It comes from the 2008 paper Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. The data set is from the Human Genome Diversity Project. It consists 52 groups from around the world, curated for representativeness, but also ethnic distinctiveness. They utilized the FRAPPE program, which like STRUCTURE and ADMIXTURE estimates the ancestry of individuals (and in the aggregate populations) from a a combination of components, the number of which you specify with the parameter K. In other words, this is model based. It works out really well when you have an intuition of the model you’re looking for. Imagine African Americans, who you can presume are a two-way admixture between two distinct ancestral populations. It works less well in other cases. For example, South Asians are modeled by 23andMe as a two-way admixture between Europeans and East Asians. Why this occurs is totally comprehensible; they have three (Chinese + Japanese = one) reference populations which are very different from South Asians. So the computer, being dumb but fast, simply slaps together the best inference possible from the weird constraints placed upon it. Garbage in, garbage out.

But along with PCA these sorts of algorithms which allow one to visualize variance across hundreds of thousands of markers across hundreds of individuals are very useful (though perhaps there are mentats amongst us who have no need of such techniques). You just have to use them with caution. Information may be free, but it can be misinterpreted!

Over the years of my blogging people have regularly asked questions of the form “are East Africans more closely related to West or South Africans?” They are easily answered, I would just look in the literature. But, it did take time, and I’d have to pick the right figure, look for Fst, and so forth. But that is changing.

The Nature piece “The rise of the genome bloggers” covered the change. Since last fall BGA and Dodecad have been dumping lot bar plots and PCAs on the web. Instead of looking for a paper, I have now begun to use those sites as my resource of first choice (since they’re well indexed by Google). Now with HAP you have another source of information. It’s gotten to the point that technically capable commenters are now submitting their own results!

We’ve come a long way. Academics are not miserly with information, and some of my best friends have been the gatekeepers of the data and results. But now you can find the data on the web easily. You can reprocess the data by yourself. And, you can do the analysis yourself.

I’ve been sitting back for a while, letting Dienekes, Zack, etc., do their thing. There are so many technically fluent people out there, I’ve enjoyed just consuming the raw information yield. But that ends today. Over the past week I’ve been slapping together some R functions to make it easier for me to generate bar plots at various K’s, as well as PCA’s. My goal is this: a reader asks a question, and I quickly constrain my data set appropriately and do the analyses, take the screenshots, and upload them to the servers here, and point them to the images in the comments. The main constrain should be the computational resources (ADMIXTURE can take hours). Yes, that’s where we’re at.

Every now and then I’m going to put up a post of ADMIXTURE bar plots or MDS/PCA’s. Part of the reason is that it will be useful for my later reference. Second, I think the slide show display view is probably pretty useful to get a gestalt sense of what’s going on. That’s what we’re going for: human comprehension. Below is my first slide show, from K = 2 to K = 16. That is, the models assumed two to sixteen ancestral populations. I also excluded Sub-Saharan Africans from the data set since they’re so varied. Here are the details:


– ~55,000 markers
– All non-African HGDP populations
– HapMap Tuscans + Gujaratis (as well some some white Americans from 23andMe)
– Bengali = my parents, N = 2

I removed some bar plots because they seemed redundant:

-Makrani
-Melanesian
-White American (these are half a dozen friends whose data I received from 23andMe)
-North Italian
-Colombians
-Karitiana

Note, these populations are simply not displayed. Their variance still was used to generate the results!

In regards to the bar plot, I did not output the legend. There’s plenty of labeling the ancestral fractions elsewhere, and it’s useful, but I think it is also important for people to take the colors in without any bias of what they mean. I have added text to some of the slides though, which you can see at the bottom if you are so inclined. I apologize for the garishness of some of the colors…I have some element of colorblindness in the purple-violent range for what it’s worth.

[zenphotopress album=263 sort=sort_order number=15]

Posted in Uncategorized

Comments are closed.