The HGDP in the post-ascertainment era

In the 1990s there was a huge debate around the “Human Genome Diversity Project” (HGDP). By the HGDP I don’t mean what you probably know as the HGDP panel, but a more ambitious attempt to genotype tens of thousands of individuals across the world. In the end activists “won”, and the grand plans came to naught. If you want to read about it, The Human Genome Diversity Project: An Ethnography of Scientific Practice has a scholarly viewpoint, though you can also just ask someone who was involved with the human population genetics community in the 1990s (this not a large set of scholars).

Ultimately the HGDP became the samples from L. L. Cavalli-Sforza’s dataset which you read about in The History and Geography of Human Genes. This is what drives the HGDP Browser. It’s also the data set at the heart of papers like Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Here is the abstract:

Human genetic diversity is shaped by both demographic and biological factors and has fundamental implications for understanding the genetic basis of diseases. We studied 938 unrelated individuals from 51 populations of the Human Genome Diversity Panel at 650,000 common single-nucleotide polymorphism loci. Individual ancestry and population substructure were detectable with very high resolution. The relationship between haplotype heterozygosity and geography was consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. In addition, we observed a pattern of ancestral allele frequency distributions that reflects variation in population dynamics among geographic regions. This data set allows the most comprehensive characterization to date of human genetic variation.

These SNPs though were ascertained on European populations. That is, the genetic variation tended to be genetic variation found in Europe. This is a problem, and one reason that the Human Origins Array was developed. The ascertainment problem was really obvious when researchers were looking at Khoisan genomes, and noticed how much variation they had that wasn’t being captured on SNP-arrays.

Today, we’ve finally moving beyond the era where ascertainment is so much of an issue. At the SMBE meeting earlier this month Anders Bergstrom presented results from the HGDP using whole-genome analysis. When you look at the whole genome, you obviate the problem with selecting a biased subset of the variation. You can look at all the variation, or vary the variation you want to look at.

Bergstrom & company will have a paper on the whole-genome analysis of the HGDP in the near future. I assume it will be somewhat like the 1000 Genomes paper, but I bet you the SNP count will be higher, because they have Khoisan in their samples (along with Mbuti, etc.). Anders shared with me some of the preliminary data that the Sanger Institute has generated.

Below the fold I plotted a PCA of the HGDP data. First, the classic SNP-chip data. Second, SNPs pulled out of the WGS which are very high quality calls (though they may still have wrong calls), but have a minor allele frequency of at least 1% (~1.5 million). You immediately notice the Eurasian compression along PC 1. Finally, using ~15 million SNPs that had no missingness in the data, you see you PC 2 being defined by San Bushmen vs. non-San-Bushmen, while Mbuti Pygmies along with Biaka clearly are the furthest along PC 1 excepting the San. There are 6 San Bushmen in the data. If there are SNPs which are very distinct to this group, and not polymorphic in other populations, then my 1% cut-off would actually remove that variation.

It’s an interesting world we live in, thanks to research groups like the Sanger Institute, Estonian Biocentre, and the 1000 Genomes Project, as well as tools such as PLINK. Analysis that took decades in the 20th century can now be whipped out in a matter of hours. Better analyses in fact.

650,000 SNPs (European ascertained)


1.5 million SNPs, 1% or more minor allele frequency

15 million SNPs, 0 “no calls”


3 thoughts on “The HGDP in the post-ascertainment era

  1. I have a friend who works a Sanger. Until last year she had been at Cambridge working in Medical genomics. She now has to supervise population geneticists. I told her to read GNXP.

  2. Is there some user’s guide for reading these charts? So many similar shades of color, and each one is used for 3 different ethnic groups.

  3. Couple questions:

    1) On European ascertainment, of course it looks most potent as a confound dealing with the deepest splits in our species, in Africa (though really Eurasian ascertainment I guess?). But how much is this a confound for generally determining the relative size of migratory barriers within different regions to Europe, outside of Africa?

    To take the example of China, one of the recent posts on Charleston Chiang’s use of China wide data found there was a bit less differentiation North-South in China than comparable European panels, and particularly weak East-West differentiation.

    Any chance that ascertainment in commonly used panels (e.g. even Human Origins) is still suppressing magnitude there? Or is ascertainment too insignificant an issue there? How prevalent is this issue for comparing Asian and West Eurasian variation generally?

    2) How do these compare with the samples of world diversity taken by the Simons Genome Diversity Project – and Estonian Biocentre Human Genome Diversity Panel (EGDP) – in 2016? HGDP totally lacks South Indian and Oceanian samples, by comparison.


Comments are closed.