Substack cometh, and lo it is good. (Pricing)

The Human Genome Diversity Project at high-coverage!

After a few years of presentations and preprints, the new high-quality whole-genome analysis of the HGDP dataset is finally published in Science, Insights into human genetic variation and population history from 929 diverse genomes. The HGDP dates back 30 years, so this is the culmination of a long line of research. The authors in this paper looked at nearly 1,000 HGDP individuals at high coverage sequencing, meaning that they had extremely good confidence in their calls of the state of a base across all 3 billion pairs.

This is in contrast to the ~600,000 markers in the original HGDP analyses from the 2000s, which came from results of a “SNP-array.” A SNP-array of this form focuses on the variation by looking at polymorphic sites (sites which vary in the population). How did they originally determine what was polymorphic? Unfortunately, they had to rely on European populations, so the original analyses were using a quite skewed measuring stick. Whole-genome analyses bypass these problems because you get the totality of sequence information, and, the high-coverage means you can confidently call very rare variations in some of these individuals (they’re not false positives).

The HGDP was assembled by L. L. Cavalli-Sforza and curated from ethnographically interesting populations. Therefore, it is useful to compare it to the 1000 Genomes, which tends to focus on more conventional populations. The 1000 Genomes has 2,500 individuals, sequenced at somewhat lower coverage on average. While this project yielded 70 million polymorphisms, the 1000 Genomes Project had 85 million. Most of these are rare. The power to detect rare polymorphisms is useful in elucidating population structure because rare polymorphisms tend to be evolutionary new, and so reflect more recent differentiation.

For example, they compared Yoruba, Mbuti, and non-Africans. Looking at common polymorphisms the Yoruba are closer to non-Africans while looking at rare ones they are closer to Mbuti. Why? The rarer polymorphisms reflect recent differentiation, and there has been recent gene flow between Mbuti and Yoruba.

On the whole, they recapitulated earlier findings but using more sophisticated methods that leveraged their whole-genome data they added some wrinkles. For example, some populations diverged in a very sharp and distinct fashion, such as Han and Yakuts, or Druze and Sardinians. But for the populations that diverged between 150,000 and 50,000 years ago, mostly within Africa, the separations were more gradual and probably characterized by repeated gene flow between the descendent groups (e.g., Non-Africans, Yoruba, Mbuti, San, etc.).

This reiterates that there isn’t a one-size-fits-all narrative we can use to talk about the emergence of modern populations and the way those populations are patterned. There are debates about whether we are a “clinal” species or not. I don’t think that’s a good question, because as implied in this paper a great deal of the past diversity has been collapsed through recent admixture events. The authors also detect deep and complex structure and differentiation. They’re clearly just scratching the surface.

Finally, there is more reiteration of the nature of Neanderthal and Denisovan admixture. The Neanderthals who mixed into early humans were quite homogeneous, or, there were not many of them. The haplotypes are not too numerous, and, they don’t exhibit the patterns you’d expect from different admixtures and source populations. The diversity is too great to be a single individual, but it could have been a small number. The main caution I would suggest here is that Neanderthals seem to often be quite homogeneous on the local scale.

The Denisovans are a different story. They detect the difference between Oceanian and non-Oceanian Denisovan ancestry (the Oceanian source Denisovans were quite distinct from the Altai Denisovans). But they also detect a different Denisovan contribution to the genomes of the Cambodians. The indigenous people of the Phillippines also harbor different Denisovan ancestry (not in this paper). The “Denisovans” seem to have been a cluster of different lineages that persisted in parallel for a long time.

Where is there to go next with the HGDP. At some point, better technologies will allow for a more thorough exploration of structural variation. I’ve emphasized this is an analysis of the sequence because that’s what it is. There is more information in non-sequence variation that they’ll get to one day (there was some structural analysis in this paper, but I believe that we are currently technology limited).

5 thoughts on “The Human Genome Diversity Project at high-coverage!

  1. Razib

    Can you explain what you mean by “structural variation”? As something that is additional to WGS analysis…

    Thank you

  2. the current technology can’t span long repeats and other gaps in the human genome. it’s incomplete. once sequencing machines can generate huge strings of letters rather than 100-1000 base pair lengths they’ll ‘span the gaps’ and finish the whole map. until then various types of copy number will not be as well covered as a simple sequence. additionally, other types of tertiary structure might matter.

    (Fwiw, i think sequence is most of the info, but we’re not totally sure)

  3. I’m not an AAAS member so I can only read the abstract, but does the full paper take any stabs at calculating a rough population divergence time estimates between Africans vs Non-Africans, or does it just leave it at the 50,000-150,000 year range you mentioned in your post? In other papers I’ve read that used the MSMC method for calculating population split times, I usually see roughly 100,000 year separation times between Mbuti and Non-African (and older splits with San), but the divergence between Yoruba vs. Non-African can be more variable, some as low as 70,000 years but I’ve also seen some calculated near 100,000 years as well.

    Also Razib, are you aware of any new papers or just general research on the human mutation rate coming out in the near future? I know you’re friends with Moorjani and she did a lot of stuff on it a couple years ago, but I was wondering if she had anything more on it coming up.

Comments are closed.