The HGDP in the post-ascertainment era

In the 1990s there was a huge debate around the “Human Genome Diversity Project” (HGDP). By the HGDP I don’t mean what you probably know as the HGDP panel, but a more ambitious attempt to genotype tens of thousands of individuals across the world. In the end activists “won”, and the grand plans came to naught. If you want to read about it, The Human Genome Diversity Project: An Ethnography of Scientific Practice has a scholarly viewpoint, though you can also just ask someone who was involved with the human population genetics community in the 1990s (this not a large set of scholars).

Ultimately the HGDP became the samples from L. L. Cavalli-Sforza’s dataset which you read about in The History and Geography of Human Genes. This is what drives the HGDP Browser. It’s also the data set at the heart of papers like Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Here is the abstract:

Human genetic diversity is shaped by both demographic and biological factors and has fundamental implications for understanding the genetic basis of diseases. We studied 938 unrelated individuals from 51 populations of the Human Genome Diversity Panel at 650,000 common single-nucleotide polymorphism loci. Individual ancestry and population substructure were detectable with very high resolution. The relationship between haplotype heterozygosity and geography was consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. In addition, we observed a pattern of ancestral allele frequency distributions that reflects variation in population dynamics among geographic regions. This data set allows the most comprehensive characterization to date of human genetic variation.

These SNPs though were ascertained on European populations. That is, the genetic variation tended to be genetic variation found in Europe. This is a problem, and one reason that the Human Origins Array was developed. The ascertainment problem was really obvious when researchers were looking at Khoisan genomes, and noticed how much variation they had that wasn’t being captured on SNP-arrays.

Today, we’ve finally moving beyond the era where ascertainment is so much of an issue. At the SMBE meeting earlier this month Anders Bergstrom presented results from the HGDP using whole-genome analysis. When you look at the whole genome, you obviate the problem with selecting a biased subset of the variation. You can look at all the variation, or vary the variation you want to look at.

Bergstrom & company will have a paper on the whole-genome analysis of the HGDP in the near future. I assume it will be somewhat like the 1000 Genomes paper, but I bet you the SNP count will be higher, because they have Khoisan in their samples (along with Mbuti, etc.). Anders shared with me some of the preliminary data that the Sanger Institute has generated.

Below the fold I plotted a PCA of the HGDP data. First, the classic SNP-chip data. Second, SNPs pulled out of the WGS which are very high quality calls (though they may still have wrong calls), but have a minor allele frequency of at least 1% (~1.5 million). You immediately notice the Eurasian compression along PC 1. Finally, using ~15 million SNPs that had no missingness in the data, you see you PC 2 being defined by San Bushmen vs. non-San-Bushmen, while Mbuti Pygmies along with Biaka clearly are the furthest along PC 1 excepting the San. There are 6 San Bushmen in the data. If there are SNPs which are very distinct to this group, and not polymorphic in other populations, then my 1% cut-off would actually remove that variation.

It’s an interesting world we live in, thanks to research groups like theĀ Sanger Institute, Estonian Biocentre, and the 1000 Genomes Project, as well as tools such as PLINK. Analysis that took decades in the 20th century can now be whipped out in a matter of hours. Better analyses in fact.

Read More