The Greeks in the mountains

The New Yorker has a long feature that explores the strange results from the paper last year, Ancient DNA from the skeletons of Roopkund Lake reveals Mediterranean migrants in India. Basically, they found a bunch of Indians who died 1,000 years ago, and, a bunch of Greeks who died a few centuries ago. They were buried naturally in a very isolated lake high in the Himalayas. There are all sorts of hypotheses regarding the Greeks, whose bones indicate a Mediterranean diet, and the closest match to individuals in Crete. My personal experience is that “mainland Greeks” tend to be a bit Northern European shifted, so these individuals may have been Anatolian or Aegean Greeks.

Stuart Fidel, who sometimes comments on this weblog, suggests these were Armenian traders. But David Reich correctly points out Armenians are very distinct genetically from Greeks (though the two are not entirely different obviously!). Another hypothesis is a bone mix-up, but the issue here is there are a lot of individuals who are of the same population and seem to have lived in the same region. How could bone mix-ups produce so many systematic errors?

Ultimately there’s no final answer in the piece, though hopefully, someone will present a reasonable conjecture.

Because the piece has Reich and his lab spotlighted, they allude to the controversy around him. This is ultimately going to be the legacy of the hit-piece from a few years back. He’s now a “controversial figure,” which is, to be frank not a bad thing in the eyes of some of the Reich lab’s scientific rivals. Most media treatments that aren’t purely about his research (i.e., Carl Zimmer’s column in The New York Times covering the Reich lab publications) will mention this now.

Here’s why he’s a mensch:

Still, some anthropologists, social scientists, and even geneticists are deeply uncomfortable with any research that explores the hereditary differences among populations. Reich is insistent that race is an artificial category rather than a biological one, but maintains that “substantial differences across populations” exist. He thinks that it’s not unreasonable to investigate those differences scientifically, although he doesn’t undertake such research himself. “Whether we like it or not, people are measuring average differences among groups,” he said. “We need to be able to talk about these differences clearly, whatever they may be. Denying the possibility of substantial differences is not for us to do, given the scientific reality we live in.”

This is, in 2020, is an old-fashioned view. There are now young American researchers who frankly express disquiet and discomfort at the idea of studying human population genetic variation, period.  Including people who themselves have studied topics such as polygenic adaptation in humans. This would be a very strange view for older researchers, but it’s not totally out of the norm today, so expect someone like Reich to be viewed as quite the dinosaur in a decade. It seems ridiculous to say, but I do wonder if we’re seeing the end of the “humans as a model organism” era. Lots of ppl are not happy with the new atmosphere, but lots of people just keep quiet and go along.


Correlated response is a big story of selection

Adaptation is clearly one of the most important processes in understanding how evolution occurs. In a classical sense, it’s easy to understand. Parallel adaptations in body plans make dolphins and swordfish shaped the same. It’s physics.

But with the emergence of DNA, a lot of the focus on adaptation has been displaced to the signatures of natural selection on the molecular level. Phenotypes are controlled by variation in genotypes, and instead of description and hypothesizing, researchers can actually infer from the genetic patterns the history and arc of adaptation. 

At least that’s the theory.

The initial tests for signatures of natural selection focused on adaptation between species. For example, Tajima’s D. Usually this took the form of comparing variation across two lineages of Drosophila. In the 2000s with genome-wide data new methods predicated on looking at ‘haplotype structure’ (variation across sequences of genes) emerged. Instead of between species, these methods focused on the selection within species (e.g., why are some humans adapted to malaria?). These methods were good at picking up strong signals at a few genes where the selective sweeps were recent.

But as datasets and genomics got bigger and better researchers focused on more fundamental patterns and analyses, such as looking at ‘site frequency spectra.’ Ultimately the goal was to go beyond selection at a single locus (e.g., lactase persistence), and understand polygenic characteristics (e.g., height). Obviously, this is much harder because polygenic characters are distributed across many genetic loci, and issues of statistical power are always going to loom large (and there is the soft vs hard sweep issue too!).

A new preprint is an excellent introduction to this wild world, Disentangling selection on genetically correlated polygenic traits using whole-genome genealogies:

We present a full-likelihood method to estimate and quantify polygenic adaptation from contemporary DNA sequence data. The method combines population genetic DNA sequence data and GWAS summary statistics from up to thousands of nucleotide sites in a joint likelihood function to estimate the strength of transient directional selection acting on a polygenic trait. Through population genetic simulations of polygenic trait architectures and GWAS, we show that the method substantially improves power over current methods. We examine the robustness of the method under uncorrected GWAS stratification, uncertainty and ascertainment bias in the GWAS estimates of SNP effects, uncertainty in the identification of causal SNPs, allelic heterogeneity, negative selection, and low GWAS sample size. The method can quantify selection acting on correlated traits, fully controlling for pleiotropy even among traits with strong genetic correlation (|rg| = 80%; c.f. schizophrenia and bipolar disorder) while retaining high power to attribute selection to the causal trait. We apply the method to study 56 human polygenic traits for signs of recent adaptation. We find signals of directional selection on pigmentation (tanning, sunburn, hair, P=5.5e-15, 1.1e-11, 2.2e-6, respectively), life history traits (age at first birth, EduYears, P=2.5e-4, 2.6e-4, respectively), glycated hemoglobin (HbA1c, P=1.2e-3), bone mineral density (P=1.1e-3), and neuroticism (P=5.5e-3). We also conduct joint testing of 137 pairs of genetically correlated traits. We find evidence of widespread correlated response acting on these traits (2.6-fold enrichment over the null expectation, P=1.5e-7). We find that for several traits previously reported as adaptive, such as educational attainment and hair color, a significant proportion of the signal of selection on these traits can be attributed to correlated response, vs direct selection (P=2.9e-6, 1.7e-4, respectively). Lastly, our joint test uncovers antagonistic selection that has acted to increase type 2 diabetes (T2D) risk and decrease HbA1c (P=1.5e-5).

There’s a lot going on here. This is my favorite passage:

To address these issues, we recently developed a full-likelihood method, CLUES, to test for selection and estimate allele frequency trajectories. 21 The method works by stochastically integrating over both the latent ARG using Markov Chain Monte Carlo, and the latent allele frequency trajectory using a dynamic programming algorithm, and then using importance sampling to estimate the likelihood function of a focal SNP’s selection coefficient, correcting for biases in the ARG due to sampling under a neutral model.

Alrighty then! Someone’s a major-league nerd.

The preprint is fine, but ultimately this is something you get a “feel” for by working with models, data, and general analyses in the field. And I don’t have a strong feel since I don’t work with these sorts of data and questions myself. So what do I know? That being said, I like the preprint because it satisfies an intuition I’ve long had: correlated response is a big part of the story of polygenic selection.

Basically, you have to remember that complex traits are subject to variation at a host of genetic positions. And genetic variants rarely have singular effects. That is, one locus usually exhibits pleiotropy. The genetic effect shapes a lot of characteristics. Therefore, if there is a strong selection on a gene, more traits than simply the target of selection will be impacted. In animal breeding making huge, meaty, fast-growing lineages can render them infertile if selection is taken too far. That’s a bad correlated response.

After correcting for the genetic correlation the authors note that some traits, such as EDU and hair color, are not really selected directly at all. This is like the fact that we know EDAR is associated with hair thickness and is a strong target of selection. We have no idea what the trait of interest is. But it’s a pretty big deal. All these quantitative traits controlled by variation across the genome are being reshaped by adaptation on other traits. What are those traits? This preprint doesn’t answer that really.

Hopefully, we’ll make some headway in the 2020s because we’re definitely looking through the mirror darkly.


Humans are basically invasive weeds

One of the somewhat surprising things we have learned over the last decade is that massive admixture and homogenization has occurred between distinct human lineages over the last 10,000 years. By this, I mean that we’re not talking simply about continuous gene-flow between neighboring populations, but massive expansions of small groups and assimilation of very different groups from the expanding groups. As a stylized fact, it looks like “Early European Farmers” we as distinct from Mesolithic hunter-gatherers as modern Northern Europeans are from Han Chinese (pairwise Fst ~0.10). The fusion of these two groups later merged in much of Europe with migrants from the east, the western edge of the forest-steppe.

The empirical pattern seems to be that cultural innovations (e.g., agriculture) trigger demographic revolutions, which homogenize and admix vast regions. This is a story of demographic history. Phylogeography.

But there is another aspect, natural selection. Humans are not exempt from this. Selection operates upon genetic variation, which is preexistent (“standing variation”), or, comes from new mutations (de novo).

It seems plausible that cultural innovation has resulted in a great deal of selection over the last 10,000 years. So where did the raw material come from? One argument that has been playing out is between those who argue that it’s from variation within human populations that is ancestral and shared, and new variation. This is where admixture comes into play.

A new preprint on bioRxiv uses the 1000 Genomes data in the New World to suggest that admixture resulted in the introduction of a lot of adaptive alleles into populations of mostly European and Native background from African ancestry. Basically, it seems likely that the American tropics were colonized by African tropical diseases, which entailed adaptations which were already existent within African populations. Admixture-enabled selection for rapid adaptive evolution in the Americas:

Background: Admixture occurs when previously isolated populations come together and exchange genetic material. We hypothesized that admixture can enable rapid adaptive evolution in human populations by introducing novel genetic variants (haplotypes) at intermediate frequencies, and we tested this hypothesis via the analysis of whole genome sequences sampled from admixed Latin American populations in Colombia, Mexico, Peru, and Puerto Rico. Results: Our screen for admixture-enabled selection relies on the identification of loci that contain more or less ancestry from a given source population than would be expected given the genome-wide ancestry frequencies. We employed a combined evidence approach to evaluate levels of ancestry enrichment at (1) single loci across multiple populations and (2) multiple loci that function together to encode polygenic traits. We found cross-population signals of African ancestry enrichment at the major histocompatibility locus on chromosome 6, consistent with admixture-enabled selection for enhanced adaptive immune response. Several of the human leukocyte antigen genes at this locus (HLA-A, HLA-DRB51 and HLA-DRB5) showed independent evidence of positive selection prior to admixture, based on extended haplotype homozygosity in African populations. A number of traits related to inflammation, blood metabolites, and both the innate and adaptive immune system showed evidence of admixture-enabled polygenic selection in Latin American populations. Conclusions: The results reported here, considered together with the ubiquity of admixture in human evolution, suggest that admixture serves as a fundamental mechanism that drives rapid adaptive evolution in human populations.

The period after 1492 is easy for us to think about. But what ancient DNA has shown us is that it’s not as uncommon a phase as we might have thought.


Whole genome sequencing comes to Cavalli-Sforza’s samples

More than twenty years ago L. L. Cavalli-Sforza published The History and Geography of Human Genes. Based on decades of analysis of ‘classical’ markers, this work lays out results of statistical genetic analyses based on a few hundred genes, as well as displaying Cavalli-Sforza’s encyclopedic ethnographic knowledge. A close look at this book will yield some familiar population groups to readers of this weblog. The reason for this is simple: the cell lines continued onward to contribute to the HGDP data set.

In 2002 Rosenberg et al. used these populations in Genetic Structure of Human Populations by looking at “377 autosomal microsatellite loci.” Microsatellites are highly variable genetic regions. They pack a lot of diversity per locus. With more input variation Rosenberg et al. advanced beyond Cavalli-Sforza’s earlier work (instead of pairwise comparisons between populations, one could infer individual relatedness as displayed in a bar plot).

But times change, and in 2008 the same data set was used in Worldwide human relationships inferred from genome-wide patterns of variation, which utilized a 650,000 marker SNP-array. Though Rosenberg et al.’s work advanced the ball considerably, the move to genome-wide analysis was even bigger. For many years this data set has been a widely used benchmark and reference (these markers and populations were part of the early basis of 23andMe’s analyses in terms of population genetic inference). As the 1000 Genomes Project moved us beyond the SNP-array period, looking the whole genome, as opposed to a specific set of SNPs, the HGDP populations were still an important complement.

The reason was simple: Cavalli-Sforza was an ethnographic genius in comparison to most geneticists and had selected very interesting and informative populations. In some ways, the original motivation given for selecting these groups, that they may have preserved phylogenetic patterns obscured in cosmopolitan populations, has only been partially justified.

Ancient DNA has shed light on the reality that almost all populations, indigenous and cosmopolitan, come out of periods of admixture between lineages which had heretofore been distinct and separate. But some of Cavalli-Sforza’s populations have been inordinately important in informing us about branches of the human family tree less well represented in the cosmopolitan samples accessible in the 1000 Genomes Project (or earlier, the HapMap). I’m thinking here of the Kalash (a relatively good proxy for “Ancestral North Indians”), Sardinians (the best representatives in the modern world of “Early European Farmers”), and African hunter-gatherers (who carry the deepest diverging lineage within the modern human clade).

With all that, finally, the HGDP whole genome preprint is out. Anders Bergstrom superstar!

Insights into human genetic variation and population history from 929 diverse genomes:

Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented private genetic variation in southern and central Africa and in Oceania and the Americas, but an absence of fixed, private variants between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the last 10,000 years, a potentially major population growth episode after the peopling of the Americas, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations. We also demonstrate benefits to the study of population relationships of genome sequences over ascertained array genotypes. These genome sequences are freely available as a resource with no access or analysis restrictions.

The authors were able to make recourse to many more subtle analytic methods with their phasing, which seems to have been considerably superior to population phasing (the HGDP does have some closely related individuals due to endogamy, but no traditional trios). Because their population set included some undersampled groups with a lot of diversity (e.g., San Bushmen), they detected about as many variants with ~1,000 individuals at good coverage as the 1000 Genome Project with ~2,500 individuals at variable coverage.

And there are major lacunae within the 1000 Genome Project data set even after taking into account ethnographically and historically important groups such as the San Bushmen. There are no Middle Eastern populations in the 1000 Genome Project. The HGDP has the Druze, Bedouin, Palestinians, and Mozabites.

The preprint requires a lot of deep reading. There is much in there that one can mull over (frankly, I’m excited about the supplementary text, but that’s just me). One thing that came to mind is that ancient DNA and other more narrow studies laid the groundwork for the interpretations that naturally fall out of this extremely rich potential set of analyses. For example, by looking at shared variants across western and central Africa the authors confirm the likely result that there is a basal human population of some sort mixed into peoples of far western Africa. And, they also confirm that the Yoruba are about ~5-10% Eurasian.

These sequences generate so much data that there are lots of potential models that might conform to them. Earlier work eliminates some possibilities and highlights others.

Ancient DNA has confirmed for many that non-Africans have Neanderthal ancestry. But there have been several debates about whether there are issues with the assumption that Africans have no Neanderthal ancestry, and how it skews statistics (e.g., if Africans have some Neanderthal one will underestimate the Neanderthal fraction). Though there are still details to be hashed out, looking at coalescence patterns of haplotypes the authors seem to be able to infer the presence of deeply diverged lineages in various populations without positing a prior model of which populations did not have the introgression as a baseline. Basically, Neanderthal and Denisovan ancestry is going to result in some “long branches” in the phylogenies of the genes within non-African populations which are lacking in Africans, and that is what they see.

These researchers also confirm the model presented by others that Neanderthal contribution seems to have been from a single admixture event (I do wonder if perhaps Neanderthals were not simply extremely homogeneous, so multiple close admixture events may not be differentiable). They also find that the “Denisovan” population structure was more complex, and there were several admixture events into eastern Eurasian and Oceanian populations.

Finally, there are attempts to adduce the nature of population differentiation, and times of separation. As noted in the text all of these sorts of analyses are sensitive to assumptions within models. They used a variety of methods which came to different results, but, one thing that seems clear is that Africa had a lot of deep structure for a long time, but gene flow between regional populations meant that genetic differentiation emerged gradually, rather than in a rapid fashion due to geographic separation. Over five years ago Iain Mathieson casually told me that he viewed much of the past 200,000 years as the collapse of deep population structure, and that does not seem to have been a crazy prediction if you read through this preprint (though the collapse may be increased rates of gene flow, rather than massive pulse admixtures).

But the separation and differentiation outside of Africa, and between the archaic lineages and Africans, seems to have exhibited more punctuation. For the past twenty years John Hawks has been emphasizing that we need to remember that during Pleistocene Africa likely had a much larger population than the rest of the world for hominins (with perhaps a caveat for lower latitude Asia). The relatively “clean” separation between the proto-modern African lineage and the Eurasian hominins, and then the quick separation between Neanderthals and the eastern group which became Denisovans, emphasizes perhaps the importance of particular geographic barriers (deserts in the Near East), as well as the lower carrying capacity in much of Eurasia. With lower population densities and patchy occupation patterns, gene flow would be sharply reduced. This would result in drift and sharply different lineages.

There are arguments out there about whether humans are a clinal species or not. These verbal descriptions really don’t tell us much. The combination of ancient DNA and whole genome data will allow us to specific at specific times and places the nature of population dynamics. If human population relationships can be thought of as a graph, a set of interconnected edges, in some areas the connections will be thicker (ergo, lots of continuous gene flow), and in other areas, the graphs will be easier to represent as diverging trees.

I think the last 10,000 years of the Holocene has brought to Eurasia a more African pattern, as deep structure comes crashing down due to rapid population expansion and mixing….


Selection for and against pigmentation alleles in South Asia

Deepika Padukone

Recently some British friends were asking about what we knew about South Asian historical genetics now. I explained that it does look like there was some migration in from the Central Asian steppe and West Asia into South Asia during the Holocene. To which one friend responded, “that’s obvious though, many Indians look like brown white people.” Setting aside the semantic paradox (if you are brown, you are literally not white), it is clear what he is getting at: due to shared ancestry the facial structure of many South Asians is not that different from West Eurasians.

The Bollywood actress Deepika Padukone is an example of someone who is rather brown-skinned (naturally), but whose facial features are such that if she went with 100% skin-bleaching she would pass as white without too much trouble. For the purposes of this post, I Googled Indian albino…and came up with this family. You can make your own judgments. I don’t know what to think of that!

The reason for this post is a newly accepted paper, Ancestry-specific analyses reveal differential demographic histories and opposite selective pressures in modern South Asian populations:

Genetic variation in contemporary South Asian populations follows a northwest to southeast decreasing cline of shared West Eurasian ancestry. A growing body of ancient DNA evidence is being used to build increasingly more realistic models of demographic changes in the last few thousand years. Through high quality modern genomes, these models can be tested for gene and genome level deviations. Using local ancestry deconvolution and masking, we reconstructed population-specific surrogates of the two main ancestral components for more than 500 samples from 25 South Asian populations, and showed our approach to be robust via coalescent simulations.

Our f3 and f4 statistics based estimates reveal that the reconstructed haplotypes are good proxies for the source populations that admixed in the area and point to complex inter-population relationships within the West Eurasian component, compatible with multiple waves of arrival, as opposed to a simpler one wave scenario. Our approach also provides reliable local haplotypes for future downstream analyses. As one such example, the local ancestry deconvolution in South Asians reveals opposite selective pressures on two pigmentation genes (SLC45A2 and SLC24A5) that are common or fixed in West Eurasians, suggesting post-admixture purifying and positive selection signals, respectively.

Read More


Genes, memes, and Mundas

The Munda languages of the northeastern quadrant of the Indian subcontinent are quite interesting because they are more closely related to the Austro-Asiatic languages of Southeast Asia than to the Indo-Aryan or Dravidian languages which are spoken by their neighbors. The Munda are usually classified as adivasi, which has connotations of being an ‘original inhabitant’ of the Indian subcontinent.

More concretely, the Munda have traditionally operated outside of the bounds of Sanskrit-influenced Hindu civilizations, occupying upland zones and governing themselves as tribal units, rather than being a caste population.

What the field of genetics tells us is that there are really no true aboriginal inhabitants of the Indian subcontinent in an unmixed form. That is, the vast majority of people in the Indian subcontinent have a substantial contribution of ancestry from the wave of migration out of Africa that occupied the southeast fringe of Eurasia beginning ~50-60,000 years ago. The modern adivasi generally are defined more by their social-cultural position within the landscape of Indian culture, as opposed to their long-term residence in the subcontinent.*

The term is a particular misnomer for the Munda because of the evidence that they are intrusive to the subcontinent from Southeast Asia. We have ancient DNA and archaeology which indicates that upland rice farmers, likely Austro-Asiatic, arrived in northern Vietnam ~4,000 years ago. This makes it unlikely to me that they were in India much earlier. The Y chromosomal data indicate that the paternal ancestry of the Munda derives from Southeast Asians, not the other way around.

A new genome-wide analysis of the Southeast Asian fraction of Munda ancestry suggests that it can be as high as ~30%. The paper is The genetic legacy of continental scale admixture in Indian Austroasiatic speakers:

Surrounded by speakers of Indo-European, Dravidian and Tibeto-Burman languages, around 11 million Munda (a branch of Austroasiatic language family) speakers live in the densely populated and genetically diverse South Asia. Their genetic makeup holds components characteristic of South Asians as well as Southeast Asians. The admixture time between these components has been previously estimated on the basis of archaeology, linguistics and uniparental markers. Using genome-wide genotype data of 102 Munda speakers and contextual data from South and Southeast Asia, we retrieved admixture dates between 2000–3800 years ago for different populations of Munda. The best modern proxies for the source populations for the admixture with proportions 0.29/0.71 are Lao people from Laos and Dravidian speakers from Kerala in India. The South Asian population(s), with whom the incoming Southeast Asians intermixed, had a smaller proportion of West Eurasian genetic component than contemporary proxies. Somewhat surprisingly Malaysian Peninsular tribes rather than the geographically closer Austroasiatic languages speakers like Vietnamese and Cambodians show highest sharing of IBD segments with the Munda. In addition, we affirmed that the grouping of the Munda speakers into North and South Munda based on linguistics is in concordance with genome-wide data.

The paper already came out as a preprint many months back, so I’ve already mentioned it. The big finding, to me, is that it uses genome-wide methods to estimate an admixture in the range of ~4,000 between the southern Munda Southeast Asian and South Asian ancestral components. It also confirms something that has been pretty evident for nearly ten years of genome-wide analysis of South Asian population genetics: the Munda have less West Eurasian ancestry even after you account for the Southeast Asian admixture than any mainland Indian population outside of the Tibeto-Burman fringe.

In Narasimhan et al. the authors present a model that fits the data where:

  1. The proto-Munda mix with an “Ancient Ancestral South Indian” (AASI) population that has no West Eurasian admixture in India’s northeast
  2. Then, mix more with an “Ancestral South Indian” (ASI) population that has some West Eurasian admixture

Read More


Patterns of genetic diversity within Africa

The violin-plot above is from a new preprint, Runs of Homozygosity in sub-Saharan African populations provide insights into a complex demographic and health history. Here’s the abstract:

The study of runs of homozygosity (ROH), contiguous regions in the genome where an individual is homozygous across all sites, can shed light on the demographic history and cultural practices. We present a fine-scale ROH analysis of 1679 individuals from 28 sub-Saharan African (SSA) populations along with 1384 individuals from 17 world-wide populations. Using high-density SNP coverage, we could accurately obtain ROH as low as 300Kb using PLINK software. The analyses showed a heterogeneous distribution of autozygosity across SSA, revealing a complex demographic history. They highlight differences between African groups and can differentiate between the impact of consanguineous practices (e.g. among the Somali) and endogamy (e.g. among several Khoe-San groups). The genomic distribution of ROH was analysed through the identification of ROH islands and regions of heterozygosity (RHZ). These homozygosity cold and hotspots harbour multiple protein coding genes. Studying ROH therefore not only sheds light on population history, but can also be used to study genetic variation related to the health of extant populations.

This sort of run-of-homozygosity analysis is enabled by high-density genotyping or whole-genome sequencing. After quality control, the authors had 1 to 1.5 million SNPs for all populations.

The interesting thing about this preprint is that by looking at the violin-plots can you can see exactly all the things that population geneticists have learned about the demography, structure, and history of humans in the past generation or so.

  • The rightmost panel shows the average total length of short ROH. Partly the pattern fits into the older serial bottleneck model of the settlement of the world. The pattern of Amerindian > East Asian > European > African. But what about the lower fractions for mixed Latin Americans and Gujuratis? This is a consequence of admixture, as these populations are mixtures in a sense of other groups.
  • The length of the long ROH segments, the second to last panel on the right, is indicative of recent patterns of marriage. Within Africa, you see some groups have many individuals with lots of long ROH segments. This is because of consanguinity. As the authors observe, the Oromo and Somali are both Cushitic speaking groups from the Horn of Africa, but the latter are universally Muslim, while only a minority of the former are. Islamic cultures have traditionally encouraged consanguineous marriages, and you can see the difference between these groups (whose total length of short segments is similar).
  • The pattern of ROH here can be predicted by simple genetic models: the extent of random mating within populations, recombination rates across the genome, and total population size. What modern genomic technology does is provide data to test the models.



The lost 50,000 years of non-African humanity

The figure above is from Efficiently inferring the demographic history of many populations with allele count data. This preprint came out a few months ago, but I was prompted to revisit it after reading Spectrum of Neandertal introgression across modern-day humans indicates multiple episodes of human-Neandertal interbreeding.

The latter paper indicates that there were multiple waves to Neanderthal admixture into both Europeans and East Asians. The motivation to do the analysis is that East Asians are about ~12 percent more Neanderthal than Europeans. The authors don’t reject the idea that there was ‘dilution’ of Neanderthal through selection and especially admixture with a “Basal Eurasian” group which didn’t have Neanderthal ancestry. I don’t want to get into the details of the results except for one thing: the preprint confirms a consistent finding over the past eight years that the Neanderthal contribution to the modern human genome is from a single population.

Perhaps it was a small population. Or perhaps it was a large population that had gone through a bottleneck and was genetically not very differentiated. But unlike Denisovans it seems that it was a particular Neanderthal lineage that interacted with modern humans.

Moving back to the “Basal Eurasians,” notice some details of the schematic above. The divergence of Basal Eurasians from other non-Africans was ~80,000 years ago, across an interval of 70 to 100 thousand years ago. The admixture of Basal Eurasians into the proto-LBK population occurred ~30,000 years ago, across an interval of 11 to 41 thousand years ago. Ancient DNA from North Africa indicates that Basal Eurasians were already well admixed well before 11 thousand years ago.

The other dates make sense. 50,000 years for Europeans-Han Chinese, 96,000 years for Mbuti-Eurasians, and 696,000 years for Neanderthal-modern humans.

Ancient modern humans were highly structured. We know this from within Africa. But it seems clear that modern humans who had crossed over the other side of the Sahara also exhibited the same tendency. Basal Eurasians did not mix with Neanderthal populations. I suspect that that might be due to the fact that they were in Northeast Africa. At some point in the Pleistocene a mixing event occurred. This may have been precipitated by drier conditions and human retreat into only a few habitable areas, and the original Basal Eurasian populations may have mixed into other Near Eastern groups, which were part of the broader Neanderthal-mixed populations.


The great bottleneck after the post-Eemian separation

I’ve been thinking about effective population size. Basically it’s the inferred breeding population you estimate in the present, or in many cases the past, based on the genetic variation you see within the population. Another way to say it is that it’s the population size that can explain the genetic drift that you see in the data.

To give a concrete example, the population of the New England states of America was ~1,000,000 during the 1790 Census. The vast majority of this was due to natural increase from a settler population of about ~50,000 in 1650 (total fertility rate of women in New England was seven children in the years between 1650 and 1700). Of these, ~23,000 were Puritans or the offspring of Puritans who migrated around between 1630 and 1643 (due to religious differences with the English government of the period). One might think that a population of ~1,000,000 would be genetically diverse, but the ~50,000 in 1650 matter a lot more than the ~1,000,000 in 1790. The rate of mutation accumulation is pretty slow, so a population bottleneck or subsample has a huge long-term effect.

In fact, as you probably know one of the biggest determinants of genetic variation in New England whites of 1790 is the bottleneck that they share with all other non-Africans that dates to 50,000 years or more before 1790!

And these are just the coarse demographic considerations on the broader population/historical scale. In any normal random-mating human population, there’s some reproductive variance by chance (usually it is modeled as a poisson distribution; mean and variance being the same, though from I have read the variance in mammals is usually greater than the mean).

Some people have more children, and some people have fewer children. That means that there is a census population, and a breeding population, and the breeding population is invariably smaller than the census population. Some individuals don’t reproduce to the next generation, obviously. But there are also cases where some individuals have large numbers of surviving offspring, while others have only a few.

To make it concrete I plotted the distribution of the number of children of women older than 50 years of age from the year 2000 and later in the General Social Survey (GSS). You can see that the most common number is two, but there are a fair number with three. Only about 10% of women 50 years and older have no children in the GSS.

But the curious thing is that if you weight the number by the proportion, you notice that women who have three children may not be as common as women who have two children, but they are contributing more children to the next generation than women who have the more typical two children. And, though the number of women who have five or more children is only 11% of the sample, as opposed to 14% who have one child, they contribute nearly five times as many children as those with one child to the next generation (women with six children alone contribute more than women with one child).

Basically, not all the genetic variation in a given generation is created equally. Some people will contribute more to the next generation, and that has a homogenizing effect (there are models of mutation/selection/drift which establish equilibria values of variation in a stationary state).

I’m revisiting all of this for two reasons. First, in Who We Are And How We Got Here David Reich talks about a long period of a shared population bottleneck for “Out of Africa” (all non-Africans) groups before the primary expansion ~60,000 years ago. Second, in my conversation with Matt Hahn, he was very skeptical of drawing any correspondence between effective population and some inferred census size. In hindsight I think part of it is that in most organisms census quotes are more an art than science. Not so with humans.

This made me look more into the literature for humans again. Recently Browning et al. published Ancestry-specific recent effective population size in the Americas. It’s a great paper. Basically, it uses identity by descent tracts of different ancestry to tease apart the distinctive pre-admixture effective population sizes. If you take an admixed population and assume that it was a single population random-mating indefinitely, and then work backward in time, you’re probably going to produce rather strange effective population sizes (if the two groups are about the same genetic diversity beforehand, they’ll probably show an inflated effective population, because you are assuming the two groups were a big random-mating population long before they were randomly mating!).

There are many ways to infer effective population, and the identity by descent method seems reasonable for recent time periods. And one thing about recent population size estimates for humans is that you have reasonable census estimates (you don’t just check with simulations):

Our simulations showed that biased sampling of a structured population results in underestimation of most recent effective population size. When we compare the estimated current effective sizes of HCHS/SOL country-of-origin populations to World Bank population sizes (accessed via Google Public Data Explorer) from 1995 (when the average age of the sampled individuals was around 25), we find that the ratio of current estimated effective size to 1995 population size ranges from approximately 1/60 (Ecuador) to approximately 1/4 (Cuba), with typical values around 1/10. Although estimates of effective size in the most recent generations are affected by these issues, our simulations also showed that less recent generations are not affected. Thus our estimates are useful for learning about the effective population sizes at and before admixture.

The structured part is important. For example, the paper On the importance of being structured: instantaneous coalescence rates and human evolution—lessons for ancestral population size inference? explores how structured models of gene-flow might be confused when genomic inferences assume a panmictic population. Last year a paper in PNAS, Early history of Neanderthals and Denisovans, suggested that Neanderthals were characterized by a high structured meta-population, and that low effective populations from sampled genomes in this group of humans reflects this, rather than a genuinely low census size.

Browning et al. focused on recent population size inferences. I was curious about these inferences because we can compare them to real census sizes. From this I think I can tune my intuition at least to the possibily that census size of a random mating population is not likely to be two orders of magnitude above the inferred effective population size. Conversely, the rough mammalian value of an effective population size of ~1/3 the census size seems to be a ceiling. Population structure and bottleneck aside, humans seem to have enough basal reproductive skew that effective population size is less than half of the census size.

To focus on ancient population growth (or lack thereof), I reread Inferring human population size and separation history from multiple genome sequences (Schiffels et al. 2014), Exploring Population Size Changes Using SNP Frequency Spectra (Liu et al. 2015) and Neutral genomic regions refine models of recent rapid human population growth (Gazavea et al. 2014). The first two papers seem to suggest an “Out of Africa” population bottleneck that’s pretty long, with an effective population that’s somewhat lower than 5,000 individuals. In contrast, the last paper seems to have a sharp bottleneck of 200 individuals.

Remember, different models can produce the same empirical patterns in the genome. You can reduce genetic diversity by a modest, but long, bottleneck. Or, through a very sharp short bottleneck.

In Who We Are and How We Got Here David Reich definitely leans toward a long, but more modest, bottleneck. For anthropological and archaeological reasons this seems more plausible now than it did ten years ago.

But perhaps it makes more sense now that we have more ancient DNA and a more elaborated model of human history seen through the lens of population genetics. In Schlebusch and Jakkbonson’s Tales of Human Migration, Admixture, and Selection in Africa the authors come out say “For our species’ deep history in Africa, both paleoanthropological and genetic evidence increasingly point to a multiregional origin of AMHs [anatomically modern humans] in Africa.”

They’re only saying what I hear other people talking about.

Instead of the “Out of Africa bottleneck” being defining for our species, it’s only a phenomenon which is important for peoples outside of Sub-Saharan Africa. Arguably for the majority of the existence of our species something closer to multi-regionalism was operative within modern humans.

If fact, isn’t that what the new ancient DNA shows? Pulses of admixture and gene-flow between distinct groups? Arguably multiregionalism might be the answer to our origins, but also characterize many of the dynamics after the “Out of Africa” event.

In any case, the best evidence now points to the likelihood that modern human lineages began to diversify and diverge before 200,000 years ago. Conversely, most of the ancestry of modern humans outside of Africa dates to an expansion around ~60,000 years before the present (ancient DNA and archaeology seem to agree here).

This is probably right before the Neanderthal admixture event with non-African humans, at least the modern lineages we have around today. But, it turns out it does not define the point when non-African humans diverged from the ancestral African population. Another group, “Basal Eurasians” (who may not have been Eurasian at all), diverged before the expansion of all eastern non-Africans, Oceanians, as well as the ancestors of Pleistocene Europeans and Siberians. It does not seem that Basal Eurasians had any Neanderthal admixture. Basal Eurasian ancestry is substantial in the Middle East today (although lower than 50%), and non-trivial across broad swaths of Europe and South Asia, due to the expansion of farming. They seem to have been well mixed in places like North Africa with other Eurasian groups ~15,000 years ago. Presumably that was a “back to Africa” migration, since these people had Neanderthal ancestry.

All of this leads to the conclusion that the ancestors of Basal Eurasians/non-Africans must have gone through their shared bottleneck well before ~60,000 years before the present. And, it may have happened on the African continent. So with that, I’ll quote Schiffels et al.:

This comparison reveals that no clean split can explain the inferred progressive decline of relative cross coalescence rate. In particular, the early beginning of the drop would be consistent with an initial formation of distinct populations prior to 150kya, while the late end of the decline would be consistent with a final split around 50kya. This suggests a long period of partial divergence with ongoing genetic exchange between Yoruban and Non-African ancestors that began beyond 150kya, with population structure within Africa, and lasted for over 100,000 years, with a median point around 60-80kya at which time there was still substantial genetic exchange, with half the coalescences between populations and half within (see Discussion). We also observe that the rate of genetic divergence is not uniform but can be roughly divided into two phases. First, up until about 100kya, the two populations separated more slowly, while after 100kya genetic exchange dropped faster.

David Reich’s group, and others, now posit the existence of “Basal Human” population that mixed into West Africans, who can be modeled as primarily proto-East African (without Eurasian admixture), as well as this ancient outgroup. This means that estimates of divergences with non-Africans from something like MSMC may generate a composite if proto-East Africans are closer to the ancestors of non-Africans, which seems likely. One likely model is that the “Out of Africa” population emerged out of the northern edge of this proto-East African distribution of modern humans over 100,000 years ago (but after groups like the Khoisan and Basal Humans had already diverged).

Looking at Schiffel et al., they seem to posit lower in divergence times than seems likely to me. Is that perhaps due to unaccounted for admixture in lineages which fuse together groups which were earlier distinct?

In any case, with details about the divergence dates set aside, the MSMC results are actually in line with a new congealing consensus. Deep structure within Africa, but gene-flow between distinct populations, for at least ~100,000 years (possibly more). This is the period when population structure was quite fluid and indistinct along the East Africa continuum out of with non-Africans emerged.

Also, the archaeological evidence is now strongly suggestive of modern humans in places like Southeast Asia over 10,000 years before the wave which led to the ancestry of most extant populations. In fact, we know that this sort of early migration with no descendants isn’t abnormal. The first modern humans in Europe left no descendants (at least in any appreciable quantity). And the Altai Neanderthal seems to have modern-like admixture that dates to ~100,000 years before the present.

With all the evidence that modern humans were present in Africa, and expansively so, for hundreds of thousands of years, it seems unlikely that they never mixed with “archaic” Eurasian  lineages (and vice versa). In fact, as we obtain more and more Neanderthal and Denisovan genomes perhaps we’ll find that a rapid expansion like the one that occurred ~60,000 years ago across Eurasia and Oceania happened before, out of and/or into Africa.

Looping back to the effective population issue, the effective population of modern non-Africans seems to have been below ~5,000 for a while. There was minimal gene-flow with other populations for many generations. Reich has a schematic of 40,000 years between 90,000 and 50,000 BP in Who We Are and How We Got Here. But that’s obviously just a ballpark figure. I have a hard time believing that the census size was around 500,000. The world population 10,000 years ago is usually estimated to be 1 to 10 million. Human populations were probably much larger at the end of the Pleistocene than 100,000 years ago. But a figure of 10% effective would give 50,000, which seems a reasonable number, especially with the likelihood that we’re talking about many tribes over a wide ecological zone. Meta-population dynamics of extinction and resettlement in inclement periods probably drove down the effective population.

The separation seems to be distinct from the older multiregional phase. What could explain it? The existence of the Sahara, and periods of extreme desertification seems the most likely candidate. I can’t say much with any credibility because I don’t know the archaeology and paleoclimate literature, but before domesticated animals, it was probably difficult for hunter-gatherers to make a go of it in the deep Sahara during the driest phases.

If I had to bet, the Eemian interglacial, 130 to 115 thousand years ago, is when I would assume there was:

  1. Lots of gene flow across the Sahara, perhaps in both directions
  2. A major population expansion of humans, of all sorts

This gives plenty of time for a wave of modern humans to push east, probably going through milder climates, rather than expanding north into Neanderthal or Denisovan territory. Eventually, some group must have mixed with the ancestors of the Altai Neanderthals. It seems likely that a cold and dry spell after the Eemian would have been optimized more to the well adapted Eurasian groups, and modern populations would have withdrawn into refugia. The brutally expanding Sahara would have divided the majority of modern humans, who existed in the meta-populations to the south that dated back hundreds of the thousands of years, from the groups on the northern fringe.

One can imagine that large numbers of modern humans were either absorbed or went extinct with the expansion of Neanderthals and other archaics. Though Neanderthals and Denisovans were interfertile with moderns, the lineages were still distinct enough that it looks like there was some hybrid breakdown. Just as modern humans seem to have purged many Neanderthal alleles from our genome, the opposite dynamic was probably at work.

There was clearly some structure in the relict modern human group that was separated from the African populations. Basal Eurasians did not mix with Neanderthals, but the ancestors of all other non-African humans did. Though one has to be careful about such geographical inferences, that suggests to me that the range of modern humans in the period between 60,000 to 80,000 years ago extended further back into pockets of northeast Africa, where no contact with Neanderthals would have occurred. Perhaps, in the end, we’ll end up thinking that the Basal Eurasians in some ways were a lot more like Africans south of the Sahara, as they didn’t undergo the massive range expansion of other populations during the Upper Paleolithic.

I’ll end with some predictions.

  • Ancient DNA of proto-moderns and archaics in eastern Eurasia dated to between 50,000 to 100,000 years BP will be analyzed at some point and will exhibit a fair amount of admixture. That is, the Altai Neanderthal was not exceptional, and probably relatively attenuated. I’m moderately confident of this.
  • The pre-60,000 year eastern Eurasians will be found to have left some of their genes in modern eastern Eurasians. Especially in Southeast Asia and Oceanian. Probably in the 1-10% range. I’m moderately confident of this.
  • The Denisovan ancestry in Oceanians is mediated by a “first wave” group “Out of Africa.” I have low confidence in this, but I really wouldn’t be surprised either way. My confidence in my confidence is low!
  • At some point we’ll obtain sequence from a 1 million year old hominin somewhere in the colder/drier climes of Eurasia (we have a 900,000 year old horse genome). This will predate Neanderthal/Denisovans. We will see from this that some of these super-archaic populations left their heritage in later archaics, and therefore our own lineage. I’m rather confident of this.
  • By hook or crook we’ll get more ancient genomes out of African samples, and confirm a lot of ancient population structure, as well as some gene-flow from archaic non-modern lineages. Probably around the same range you see in non-Africans (though some of the gene-flow may also apply to non-Africans, since they didn’t separate from eastern Africans until 100,000 to 150,000 years ago). I’m rather confident of this.
  • H. naledi will return sequence at some point. I’m very confident of this. I don’t have inside knowledge, but I know they’re going to keep trying. They are getting more samples.
  • H. naledi will be found to have contributed ancestry to modern southern African populations. I’m moderately confident of this.
  • At some point ancient genomes from the Americas will confirm the existence of an earlier group which was only distantly related to modern New World populations descended mostly from Siberians. There is indirect evidence of this group from South American populations, but we’ll get individuals who are much more distinct at some point in the future. I’m moderately confident of this.
  • Basal Eurasians will be found to have inhabited Southern Arabia/Persian Gulf region. But “pure” population will have been found to have disappeared around the Last Glacial Maximum ~20,000 years ago, as the human populations to the north moved south, and the Near East’s southern fringe became drier. I’m moderately confident of this.

Selection is going on with SLC24A5….

The ancestral allele for rs1426654 at SLC24A5

On this week’s episode of The Insight, I talked to Matt Hahn about why he wrote his new book, his opinions on “Neutral Theory”, and what he thought about David Reich’s op-ed. Without Spencer’s supervision, I have to admit that I think I lost control and just went “full nerd”. Next week we’re dropping Carl Zimmer’s podcast, so rest assured that the world will come back into balance, and The Insight will be more welcoming to civilians!

At a certain point, Matt and I were discussing allele frequency differences between populations and he came close to saying all such differences between human populations were of modest frequency in relation to pairwise comparisons (e.g., 40% vs. 49%). Obviously, this is not true, because there is always the huge difference in SLC24A5 at SNP rs1426654 (at Duffy and a few other loci). A substitution of a G for an A converts the codon from alanine to threonine.

You have heard of this locus because of a paper in 2005, SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. This paper came out in December of 2005, a few years after Armand Leroi wrote in Mutants that geneticists still hadn’t come to grips with normal variation in pigmentation in humans. The above publication was the first step in solving this question in the years between 2005 to 2010, at least to a good first approximation.

In the sample in the paper they explain 25-40% of the variation in melanin index between Africans and Europeans with this single genetic change (for various technical reasons it’s probably not that big an effect, though it is still big, and probably the largest effect quantitative trait locus for pigmentation in the human genome).

It turns out that this mutation, the derived variant, is almost disjoint is frequency between Europeans and Africans. That is, about ~100% of Africans carry the ancestry G base at while ~0% of Europeans carry the G base (as opposed to the A base). Interestingly, East Asians carry the G base at ~100% frequency as well. If you genotype an anonymous individual and their genotype is AG or GG on at rs1426654 then it is highly likely that that individual is not a European.

To give an example of how this works, in 2013 I stumbled onto a paper which genotyped 101 Europeans from Cape Town in South Africa. That means there are 202 alleles (two per person) at rs1426654. Of these, 5 of the alleles were ancestral (G). From this, I immediately concluded that it was highly likely that the Afrikaaner people of South Africa have non-European ancestry. I came to this conclusion because of 5 copies of the ancestral allele, ~2.5%, is shockingly high for a European population, and it was long surmised that the Afrikaaner people had some non-European heritage (Khoisan, Bantu, South and Southeast Asian) ancestry. The major of the whites sampled in Cape Town could have been Afrikaaners (I’ve confirmed this with genome-wide data).

To get a sense of where my intuitions come from you need to look at allele counts within populations. Using 1000 Genomes, Yale’s Alfred, and Gnomad I assembled a representative list to give you a sense of what’s going on. Using 126,548 counted alleles in Gnomad for individuals of European (non-Finnish) descent you see that 0.38% out of the total, 486, are ancestral.

PopulationAncestral allelesTotal allelesFreq
Greeks (Thrace, Athens)01840%
Pandit Brahmin, Kashmir0400%
European (Non-Finnish)4861265480%
Ashkenazi Jewish47101480%
European (Finnish)329257901%
Iraq Kurds1682%
Yemenite Jews2783%
Havyaka Brahmin, Karnataka2623%
Tunisian Berber61105%
Uttar Pradesh Brahmin43412%
Pandit Brahmin, Haryana137817%
South Asian69213077422%
Sri Lanka Tamil10520451%
Adi-Dravida, Karnataka213462%
Masai Kenya19228667%
Austro-Asiatic tribe, Odisha435677%
Luhya Kenya15518882%
Mende Sierra Leone15517091%
Austro-Asiatic tribe, Odisha929696%
Esan Nigeria19319897%
Yoruba Nigeria21321699%
East Asian187281885699%

Last fall Crawford et al. reported that rs1426654 is embedded in a haplotype that’s about ~30,000 years ago. Additionally, they contend that its presence within Africa is probably no earlier than the Holocene, the last ~12,000 years.  Martin et al. report that KhoeSan exhibit higher frequencies of the derived allele because of Eurasian back-migration and then in situ natural selection. Of course, not all Eurasians. Most East Asians have the ancestral variant of rs1426654.

This leaves us with West Eurasians, North Africans, and South Asians. I’ve put a few South Asian populations in the list to show you that there is a wide range of variation in allele frequencies. The South Asians in Gnomad, probably mostly Diaspora, have the ancestral variant at only 22%. In contrast, Austro-Asiatic speaking South Asian groups from northeast India have very high frequencies of the ancestral variant. There has clearly been in situ selection in some South Asian populations for the derived variant at rs1426654. Ancestral North Indian groups (ANI) probably brought the derived allele, and Ancient Ancestral South Indians (AASI) probably tended to carry the ancestral allele, like East Eurasians and Oceanians. Additionally, South Asian populations often have high drift. Some of the differences in the Alfred data seem to be impacted by this.

The situation in the Middle East, North Africa, and Europe is different.  In the Middle East and North Africa, the ancestral variant is present at frequencies around 1-10%.  Some of this can probably be attributed to admixture from Africa and in some cases South and East Asian populations. Ancient DNA from the Middle East and North Africa presents a mixed picture. The farmers who brought the Neolithic to Europe carried the derived variant at rs1426654, and some of the ancient Middle Eastern samples carry it. But not all. The recent Iberiomauserian samples which date to ~15,000 years ago don’t seem to have had the derived variant.

Though the hunter-gatherers of Western Europe only seem to have carried the ancestral variant at rs1426654, the hunter-gatherers of Scandinavia and Eastern Europe did exhibit the derived variant in some frequency, though lower than modern Europeans.

My own hunch is that the original genetic background against which the A mutation at rs1426654 emerged will be found increasing in frequency first somewhere in the Near East after the Last Glacial Maximum. But no ancient population shows the frequencies of the derived variant we see in modern Europeans. In isolated populations subject to drift it wouldn’t be surprising if the ancestral variant decreased to ~0%, But in European populations today in the vast majority of cases the ancestral variant is far lower than 1%, even though we know that within the last 10,000 years the ancestral populations streams had several groups with very high frequencies of that ancestral variant. The low frequency is not due to a freakish bottleneck all across Europe. It has to be selection

One thing I have pointed out is that this very low frequency of the ancestral variant indicates that the advantage at rs1426654 for the A allele in Europe is additive. In Northern Europe, the frequency of the derived variant that confers lactase persistence tops out at around ~90 percent. We know this region of the genome has been targeted by natural selection, but lactase persistence also happens to express dominantly genetically. That is, one variant of the mutant allele confers the phenotype. Once you hit ~90 percent of the derived variant only ~1 percent of the population would be lactose intolerant homozygotes (two copies of the ancestral variant). In the Gnomad sample of 60,000+ Europeans, they count three homozygote genotypes rs1426654. That’s 0.005%.

Something is happening at rs1426654. Selection. But why? No one really has any explanation beyond the obvious.