Whole genome sequencing comes to Cavalli-Sforza’s samples

More than twenty years ago L. L. Cavalli-Sforza published The History and Geography of Human Genes. Based on decades of analysis of ‘classical’ markers, this work lays out results of statistical genetic analyses based on a few hundred genes, as well as displaying Cavalli-Sforza’s encyclopedic ethnographic knowledge. A close look at this book will yield some familiar population groups to readers of this weblog. The reason for this is simple: the cell lines continued onward to contribute to the HGDP data set.

In 2002 Rosenberg et al. used these populations in Genetic Structure of Human Populations by looking at “377 autosomal microsatellite loci.” Microsatellites are highly variable genetic regions. They pack a lot of diversity per locus. With more input variation Rosenberg et al. advanced beyond Cavalli-Sforza’s earlier work (instead of pairwise comparisons between populations, one could infer individual relatedness as displayed in a bar plot).

But times change, and in 2008 the same data set was used in Worldwide human relationships inferred from genome-wide patterns of variation, which utilized a 650,000 marker SNP-array. Though Rosenberg et al.’s work advanced the ball considerably, the move to genome-wide analysis was even bigger. For many years this data set has been a widely used benchmark and reference (these markers and populations were part of the early basis of 23andMe’s analyses in terms of population genetic inference). As the 1000 Genomes Project moved us beyond the SNP-array period, looking the whole genome, as opposed to a specific set of SNPs, the HGDP populations were still an important complement.

The reason was simple: Cavalli-Sforza was an ethnographic genius in comparison to most geneticists and had selected very interesting and informative populations. In some ways, the original motivation given for selecting these groups, that they may have preserved phylogenetic patterns obscured in cosmopolitan populations, has only been partially justified.

Ancient DNA has shed light on the reality that almost all populations, indigenous and cosmopolitan, come out of periods of admixture between lineages which had heretofore been distinct and separate. But some of Cavalli-Sforza’s populations have been inordinately important in informing us about branches of the human family tree less well represented in the cosmopolitan samples accessible in the 1000 Genomes Project (or earlier, the HapMap). I’m thinking here of the Kalash (a relatively good proxy for “Ancestral North Indians”), Sardinians (the best representatives in the modern world of “Early European Farmers”), and African hunter-gatherers (who carry the deepest diverging lineage within the modern human clade).

With all that, finally, the HGDP whole genome preprint is out. Anders Bergstrom superstar!

Insights into human genetic variation and population history from 929 diverse genomes:

Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented private genetic variation in southern and central Africa and in Oceania and the Americas, but an absence of fixed, private variants between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the last 10,000 years, a potentially major population growth episode after the peopling of the Americas, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations. We also demonstrate benefits to the study of population relationships of genome sequences over ascertained array genotypes. These genome sequences are freely available as a resource with no access or analysis restrictions.

The authors were able to make recourse to many more subtle analytic methods with their phasing, which seems to have been considerably superior to population phasing (the HGDP does have some closely related individuals due to endogamy, but no traditional trios). Because their population set included some undersampled groups with a lot of diversity (e.g., San Bushmen), they detected about as many variants with ~1,000 individuals at good coverage as the 1000 Genome Project with ~2,500 individuals at variable coverage.

And there are major lacunae within the 1000 Genome Project data set even after taking into account ethnographically and historically important groups such as the San Bushmen. There are no Middle Eastern populations in the 1000 Genome Project. The HGDP has the Druze, Bedouin, Palestinians, and Mozabites.

The preprint requires a lot of deep reading. There is much in there that one can mull over (frankly, I’m excited about the supplementary text, but that’s just me). One thing that came to mind is that ancient DNA and other more narrow studies laid the groundwork for the interpretations that naturally fall out of this extremely rich potential set of analyses. For example, by looking at shared variants across western and central Africa the authors confirm the likely result that there is a basal human population of some sort mixed into peoples of far western Africa. And, they also confirm that the Yoruba are about ~5-10% Eurasian.

These sequences generate so much data that there are lots of potential models that might conform to them. Earlier work eliminates some possibilities and highlights others.

Ancient DNA has confirmed for many that non-Africans have Neanderthal ancestry. But there have been several debates about whether there are issues with the assumption that Africans have no Neanderthal ancestry, and how it skews statistics (e.g., if Africans have some Neanderthal one will underestimate the Neanderthal fraction). Though there are still details to be hashed out, looking at coalescence patterns of haplotypes the authors seem to be able to infer the presence of deeply diverged lineages in various populations without positing a prior model of which populations did not have the introgression as a baseline. Basically, Neanderthal and Denisovan ancestry is going to result in some “long branches” in the phylogenies of the genes within non-African populations which are lacking in Africans, and that is what they see.

These researchers also confirm the model presented by others that Neanderthal contribution seems to have been from a single admixture event (I do wonder if perhaps Neanderthals were not simply extremely homogeneous, so multiple close admixture events may not be differentiable). They also find that the “Denisovan” population structure was more complex, and there were several admixture events into eastern Eurasian and Oceanian populations.

Finally, there are attempts to adduce the nature of population differentiation, and times of separation. As noted in the text all of these sorts of analyses are sensitive to assumptions within models. They used a variety of methods which came to different results, but, one thing that seems clear is that Africa had a lot of deep structure for a long time, but gene flow between regional populations meant that genetic differentiation emerged gradually, rather than in a rapid fashion due to geographic separation. Over five years ago Iain Mathieson casually told me that he viewed much of the past 200,000 years as the collapse of deep population structure, and that does not seem to have been a crazy prediction if you read through this preprint (though the collapse may be increased rates of gene flow, rather than massive pulse admixtures).

But the separation and differentiation outside of Africa, and between the archaic lineages and Africans, seems to have exhibited more punctuation. For the past twenty years John Hawks has been emphasizing that we need to remember that during Pleistocene Africa likely had a much larger population than the rest of the world for hominins (with perhaps a caveat for lower latitude Asia). The relatively “clean” separation between the proto-modern African lineage and the Eurasian hominins, and then the quick separation between Neanderthals and the eastern group which became Denisovans, emphasizes perhaps the importance of particular geographic barriers (deserts in the Near East), as well as the lower carrying capacity in much of Eurasia. With lower population densities and patchy occupation patterns, gene flow would be sharply reduced. This would result in drift and sharply different lineages.

There are arguments out there about whether humans are a clinal species or not. These verbal descriptions really don’t tell us much. The combination of ancient DNA and whole genome data will allow us to specific at specific times and places the nature of population dynamics. If human population relationships can be thought of as a graph, a set of interconnected edges, in some areas the connections will be thicker (ergo, lots of continuous gene flow), and in other areas, the graphs will be easier to represent as diverging trees.

I think the last 10,000 years of the Holocene has brought to Eurasia a more African pattern, as deep structure comes crashing down due to rapid population expansion and mixing….

The HGDP in the post-ascertainment era

In the 1990s there was a huge debate around the “Human Genome Diversity Project” (HGDP). By the HGDP I don’t mean what you probably know as the HGDP panel, but a more ambitious attempt to genotype tens of thousands of individuals across the world. In the end activists “won”, and the grand plans came to naught. If you want to read about it, The Human Genome Diversity Project: An Ethnography of Scientific Practice has a scholarly viewpoint, though you can also just ask someone who was involved with the human population genetics community in the 1990s (this not a large set of scholars).

Ultimately the HGDP became the samples from L. L. Cavalli-Sforza’s dataset which you read about in The History and Geography of Human Genes. This is what drives the HGDP Browser. It’s also the data set at the heart of papers like Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Here is the abstract:

Human genetic diversity is shaped by both demographic and biological factors and has fundamental implications for understanding the genetic basis of diseases. We studied 938 unrelated individuals from 51 populations of the Human Genome Diversity Panel at 650,000 common single-nucleotide polymorphism loci. Individual ancestry and population substructure were detectable with very high resolution. The relationship between haplotype heterozygosity and geography was consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. In addition, we observed a pattern of ancestral allele frequency distributions that reflects variation in population dynamics among geographic regions. This data set allows the most comprehensive characterization to date of human genetic variation.

These SNPs though were ascertained on European populations. That is, the genetic variation tended to be genetic variation found in Europe. This is a problem, and one reason that the Human Origins Array was developed. The ascertainment problem was really obvious when researchers were looking at Khoisan genomes, and noticed how much variation they had that wasn’t being captured on SNP-arrays.

Today, we’ve finally moving beyond the era where ascertainment is so much of an issue. At the SMBE meeting earlier this month Anders Bergstrom presented results from the HGDP using whole-genome analysis. When you look at the whole genome, you obviate the problem with selecting a biased subset of the variation. You can look at all the variation, or vary the variation you want to look at.

Bergstrom & company will have a paper on the whole-genome analysis of the HGDP in the near future. I assume it will be somewhat like the 1000 Genomes paper, but I bet you the SNP count will be higher, because they have Khoisan in their samples (along with Mbuti, etc.). Anders shared with me some of the preliminary data that the Sanger Institute has generated.

Below the fold I plotted a PCA of the HGDP data. First, the classic SNP-chip data. Second, SNPs pulled out of the WGS which are very high quality calls (though they may still have wrong calls), but have a minor allele frequency of at least 1% (~1.5 million). You immediately notice the Eurasian compression along PC 1. Finally, using ~15 million SNPs that had no missingness in the data, you see you PC 2 being defined by San Bushmen vs. non-San-Bushmen, while Mbuti Pygmies along with Biaka clearly are the furthest along PC 1 excepting the San. There are 6 San Bushmen in the data. If there are SNPs which are very distinct to this group, and not polymorphic in other populations, then my 1% cut-off would actually remove that variation.

It’s an interesting world we live in, thanks to research groups like the Sanger Institute, Estonian Biocentre, and the 1000 Genomes Project, as well as tools such as PLINK. Analysis that took decades in the 20th century can now be whipped out in a matter of hours. Better analyses in fact.

Read More