Substack cometh, and lo it is good. (Pricing)

Whole genome sequencing comes to Cavalli-Sforza’s samples

More than twenty years ago L. L. Cavalli-Sforza published The History and Geography of Human Genes. Based on decades of analysis of ‘classical’ markers, this work lays out results of statistical genetic analyses based on a few hundred genes, as well as displaying Cavalli-Sforza’s encyclopedic ethnographic knowledge. A close look at this book will yield some familiar population groups to readers of this weblog. The reason for this is simple: the cell lines continued onward to contribute to the HGDP data set.

In 2002 Rosenberg et al. used these populations in Genetic Structure of Human Populations by looking at “377 autosomal microsatellite loci.” Microsatellites are highly variable genetic regions. They pack a lot of diversity per locus. With more input variation Rosenberg et al. advanced beyond Cavalli-Sforza’s earlier work (instead of pairwise comparisons between populations, one could infer individual relatedness as displayed in a bar plot).

But times change, and in 2008 the same data set was used in Worldwide human relationships inferred from genome-wide patterns of variation, which utilized a 650,000 marker SNP-array. Though Rosenberg et al.’s work advanced the ball considerably, the move to genome-wide analysis was even bigger. For many years this data set has been a widely used benchmark and reference (these markers and populations were part of the early basis of 23andMe’s analyses in terms of population genetic inference). As the 1000 Genomes Project moved us beyond the SNP-array period, looking the whole genome, as opposed to a specific set of SNPs, the HGDP populations were still an important complement.

The reason was simple: Cavalli-Sforza was an ethnographic genius in comparison to most geneticists and had selected very interesting and informative populations. In some ways, the original motivation given for selecting these groups, that they may have preserved phylogenetic patterns obscured in cosmopolitan populations, has only been partially justified.

Ancient DNA has shed light on the reality that almost all populations, indigenous and cosmopolitan, come out of periods of admixture between lineages which had heretofore been distinct and separate. But some of Cavalli-Sforza’s populations have been inordinately important in informing us about branches of the human family tree less well represented in the cosmopolitan samples accessible in the 1000 Genomes Project (or earlier, the HapMap). I’m thinking here of the Kalash (a relatively good proxy for “Ancestral North Indians”), Sardinians (the best representatives in the modern world of “Early European Farmers”), and African hunter-gatherers (who carry the deepest diverging lineage within the modern human clade).

With all that, finally, the HGDP whole genome preprint is out. Anders Bergstrom superstar!

Insights into human genetic variation and population history from 929 diverse genomes:

Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented private genetic variation in southern and central Africa and in Oceania and the Americas, but an absence of fixed, private variants between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the last 10,000 years, a potentially major population growth episode after the peopling of the Americas, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations. We also demonstrate benefits to the study of population relationships of genome sequences over ascertained array genotypes. These genome sequences are freely available as a resource with no access or analysis restrictions.

The authors were able to make recourse to many more subtle analytic methods with their phasing, which seems to have been considerably superior to population phasing (the HGDP does have some closely related individuals due to endogamy, but no traditional trios). Because their population set included some undersampled groups with a lot of diversity (e.g., San Bushmen), they detected about as many variants with ~1,000 individuals at good coverage as the 1000 Genome Project with ~2,500 individuals at variable coverage.

And there are major lacunae within the 1000 Genome Project data set even after taking into account ethnographically and historically important groups such as the San Bushmen. There are no Middle Eastern populations in the 1000 Genome Project. The HGDP has the Druze, Bedouin, Palestinians, and Mozabites.

The preprint requires a lot of deep reading. There is much in there that one can mull over (frankly, I’m excited about the supplementary text, but that’s just me). One thing that came to mind is that ancient DNA and other more narrow studies laid the groundwork for the interpretations that naturally fall out of this extremely rich potential set of analyses. For example, by looking at shared variants across western and central Africa the authors confirm the likely result that there is a basal human population of some sort mixed into peoples of far western Africa. And, they also confirm that the Yoruba are about ~5-10% Eurasian.

These sequences generate so much data that there are lots of potential models that might conform to them. Earlier work eliminates some possibilities and highlights others.

Ancient DNA has confirmed for many that non-Africans have Neanderthal ancestry. But there have been several debates about whether there are issues with the assumption that Africans have no Neanderthal ancestry, and how it skews statistics (e.g., if Africans have some Neanderthal one will underestimate the Neanderthal fraction). Though there are still details to be hashed out, looking at coalescence patterns of haplotypes the authors seem to be able to infer the presence of deeply diverged lineages in various populations without positing a prior model of which populations did not have the introgression as a baseline. Basically, Neanderthal and Denisovan ancestry is going to result in some “long branches” in the phylogenies of the genes within non-African populations which are lacking in Africans, and that is what they see.

These researchers also confirm the model presented by others that Neanderthal contribution seems to have been from a single admixture event (I do wonder if perhaps Neanderthals were not simply extremely homogeneous, so multiple close admixture events may not be differentiable). They also find that the “Denisovan” population structure was more complex, and there were several admixture events into eastern Eurasian and Oceanian populations.

Finally, there are attempts to adduce the nature of population differentiation, and times of separation. As noted in the text all of these sorts of analyses are sensitive to assumptions within models. They used a variety of methods which came to different results, but, one thing that seems clear is that Africa had a lot of deep structure for a long time, but gene flow between regional populations meant that genetic differentiation emerged gradually, rather than in a rapid fashion due to geographic separation. Over five years ago Iain Mathieson casually told me that he viewed much of the past 200,000 years as the collapse of deep population structure, and that does not seem to have been a crazy prediction if you read through this preprint (though the collapse may be increased rates of gene flow, rather than massive pulse admixtures).

But the separation and differentiation outside of Africa, and between the archaic lineages and Africans, seems to have exhibited more punctuation. For the past twenty years John Hawks has been emphasizing that we need to remember that during Pleistocene Africa likely had a much larger population than the rest of the world for hominins (with perhaps a caveat for lower latitude Asia). The relatively “clean” separation between the proto-modern African lineage and the Eurasian hominins, and then the quick separation between Neanderthals and the eastern group which became Denisovans, emphasizes perhaps the importance of particular geographic barriers (deserts in the Near East), as well as the lower carrying capacity in much of Eurasia. With lower population densities and patchy occupation patterns, gene flow would be sharply reduced. This would result in drift and sharply different lineages.

There are arguments out there about whether humans are a clinal species or not. These verbal descriptions really don’t tell us much. The combination of ancient DNA and whole genome data will allow us to specific at specific times and places the nature of population dynamics. If human population relationships can be thought of as a graph, a set of interconnected edges, in some areas the connections will be thicker (ergo, lots of continuous gene flow), and in other areas, the graphs will be easier to represent as diverging trees.

I think the last 10,000 years of the Holocene has brought to Eurasia a more African pattern, as deep structure comes crashing down due to rapid population expansion and mixing….

4 thoughts on “Whole genome sequencing comes to Cavalli-Sforza’s samples

  1. Regarding the single pulse Neanderthal introgression, I had the same thought as you – if all Neanderthals starting from around 100,000 years ago (or maybe more, I don’t remember) or so were super inbred and extremely similar, then how could they tell if there was truly only a single pulse admixture event? Figuring out whether there was one or more introgression events is super important for refining the timing of OoA and the subsequent divergence between Eurasians themselves.

    Another question I had was regarding their split time estimates – they date the non-African/Yoruba split at 76k years, when I’ve previously seen it reported around 100k. I know I’ve seen in previous papers that the divergence between West and East Eurasians started around 40,000 years ago, which is certainly too recent – Tianyuan at 40,000 years was already a clear East Eurasian. I’m wondering if these youngish split estimates we see in some of these papers is skewed by more recent bi-directional gene flow between populations, well after their core ancestries truly started diverging from one another.

  2. I’m wondering if these youngish split estimates we see in some of these papers is skewed by more recent bi-directional gene flow between populations, well after their core ancestries truly started diverging from one another.

    i think so. though for eur vs. afr has to be early as they suggest all eurasians same distance from afr. they argue what you are saying for africans. that some of the coalescences is just too old for the population scale split times.

  3. “Cavalli-Sforza was an ethnographic genius in comparison to most geneticists and had selected very interesting and informative populations.”

    A nice tribute to the late, great scientist.

  4. Cavalli-Sforza made some clever choices of samples even
    within his populations. For instance most modern Sardinians
    show evident admixture almost surely from the Italian mainland.
    But the HGDP samples have much less of this admixture, and therefore
    are closer genetically to the first farmers of Europe. I do not think
    this is just luck.

    Nick Patterson

Comments are closed.