Population Pairwise Fst on 250,000 SNPs

People routinely ask me about a place to find pairwise Fst values. I have a dataset with 250,000 SNPs and 200 populations, and a script using plink that generates pairwise differences crosses populations. Here are two files with the results:

A file with the Fst values between populations in rows

A file with the Fst values between populations as a matrix

Funnel Beaker, Corded Ware, Únětice, oh my!


Since David hasn’t mentioned it, I’m going to post some notes on Dynamic changes in genomic and social structures in third millennium BCE central Europe. This is a big deal because there’s a huge data-set spanning the Neolithic (older than 3000 BC) to the Bronze Age in Bohemia, looking at Globular Amphora, Corded Ware, Bell Beaker, and Únětice. Since I’m not too familiar with European archaeology, the most surprising thing that jumped out at me is that there was structure and variability in the nature and origins of the Neolithic societies in the region. The Bohemian Funnel Beaker populations seem to have been migrants from the west, for example.

The two big takeaways:

  1. Confirms serial admixture that tends to be female-mediated from Neolithic (though some “pure” steppe women also migrated)
  2. The Corded Ware and successor cultures in the region seem to have an affinity for an unsampled population to the north of the Yamnaya zone, in the forest-steppe

The first part is highlighted by the fact that several individuals with ~0% steppe ancestry are buried early on as “Corded Ware.” These were clearly individuals who were culturally assimilated, but their ancestry was totally different. Some of these women in particular seem to have been non-local as well, though from Neolithic societies. This suggests, unsurprisingly, that the ethnogenesis of Indo-European cultures was synthetic and complex. The figure to the top/right illustrates the trend whereby the earliest Corded Ware population exhibited far greater genetic distances between individuals than is to be found in modern European pairwise comparisons. This is part of the broader trend that over the recent past there’s been a massive worldwide panmixia.

Second, the Corded Ware has always been an awkward fit with a simple Yamnaya+Neolithic admixture. The stylized model, which I’ve repeated for simplicity, is that the Yamnaya moved west and mixed with the locals. Kristian Kristiansen explicitly refers to the Corded Ware as basically Yamnaya when I pushed him on this, and who am I to disagree with him? I think the key distinction here is that archaeologically the Corded Ware seems so much like European adaptations of the Yamnaya cultural toolkit…but genetically there are subtle indications of difference. Basically, the authors argue, plausibly, that the Corded Ware is not derived from the Yamnaya as such (their Y chromosomes do not match anyway), but a Yamnaya-adjacent population in the forest-steppe. This region seems to have also contributed a second pulse of migration which resulted in increased northeastern affinity, and a higher fraction of R1a lineages.

When it comes to the Y chromosomes, the authors conclude that inter-group competition was intense, and resulted in serial replacements of paternal lineages. The reproductive fitness gain they estimate for the elite lineages is 15% per generation, which is a very large number in evolutionary genetics (2% selection coefficients are large in this field). The Bell Beaker group seems to have been reflux from the west, and it itself was replaced later on by the Únětice.

One of the less supported, though still useful, models for the Corded Ware is a genetic influx from Pitted Ware samples, the mostly “EHG” hunter-gatherer group from Sweden. I think this supports the proportion that a group of early Yamnaya penetrated the forest-steppe, and assimilated hunter-gatherers in the southern portions of the taiga. If my read of the archaeology is correct, the overwhelmingly dominant culture of these synthetic groups was Yamnaya-like.

Finally, I have to wonder about these peoples’ association with and relationship to the Fatyanovo culture of western Russia, right in the forest-steppe. These groups seem to have been proto-Indo-Iranian judging by their R1a1a-Z93. One of the individuals in these data was clearly Z282, which is so common among Slavs (and Europe).

Complex history of archaic ancestry

On the Apportionment of Archaic Human Diversity:

The apportionment of human genetic diversity within and between populations has been measured to understand human relatedness and demographic history. Likewise, the distribution of archaic ancestry in modern populations can be leveraged to better understand the interaction between our species and its archaic relatives, and the impact of natural selection on archaic segments of the human genome. Resolving these interactions can be difficult, as archaic variants in modern populations have also been shaped by genetic drift, bottlenecks, and gene flow. Here, we investigate the apportionment of archaic variation in Eurasian populations. We find that archaic genome coverage at the individual- and population-level present unique patterns in modern human population: South Asians have an elevated count of population-unique archaic SNPs, and Europeans and East Asians have a higher degree of archaic SNP sharing, indicating that population demography and archaic admixture events had distinct effects in these populations. We confirm previous observations that East Asians have more Neanderthal ancestry than Europeans at an individual level, but surprisingly Europeans have more Neandertal ancestry at a population level. In comparing these results to our simulated models, we conclude that these patterns likely reflect a complex series of interactions between modern humans and archaic populations.

The method is pretty neat. Read this closely. Here are some takeaways:

– European Neanderthal ancestry is lower than East Asian, but more diverse

– South Asians clearly have different Denisovan ancestry than East Asians

– Population structure matters…South Asian rare allele frequency is due to admixture between divergence groups

Basically, Neanderthal and Denisovan admixture is more complex than our simple stylized models.

Natural selection caught in the act

Analysis of genomic DNA from medieval plague victims suggests long-term effect of Yersinia pestis on human immunity genes:

Pathogens and associated outbreaks of infectious disease exert selective pressure on human populations, and any changes in allele frequencies that result may be especially evident for genes involved in immunity. In this regard, the 1346-1353 Yersinia pestis-caused Black Death pandemic, with continued plague outbreaks spanning several hundred years, is one of the most devastating recorded in human history. To investigate the potential impact of Y. pestis on human immunity genes we extracted DNA from 36 plague victims buried in a mass grave in Ellwangen, Germany in the 16th century. We targeted 488 immune-related genes, including HLA, using a novel in-solution hybridization capture approach. In comparison with 50 modern native inhabitants of Ellwangen, we find differences in allele frequencies for variants of the innate immunity proteins Ficolin-2 and NLRP14 at sites involved in determining specificity. We also observed that HLA-DRB1*13 is more than twice as frequent in the modern population, whereas HLA-B alleles encoding an isoleucine at position 80 (I-80+), HLA C*06:02 and HLA-DPB1 alleles encoding histidine at position 9 are half as frequent in the modern population. Simulations show that natural selection has likely driven these allele frequency changes. Thus, our data suggests that allele frequencies of HLA genes involved in innate and adaptive immunity responsible for extracellular and intracellular responses to pathogenic bacteria, such as Y. pestis, could have been affected by the historical epidemics that occurred in Europe.

This isn’t surprising. But now that old DNA studies are getting cheap and mass-produced, I think people will be looking at changes in allele frequencies in the last 2,000 years a lot. More sophisticated methods for detecting natural selection either conclude or imply that sweeps are happening now, but this sort of study will confirm it (there’s evidence of natural selection in American Indians for obvious and unfortunate reasons).

All the Yamnaya horizon zone people looked the same

The above figure is from The Beaker Phenomenon and the Genomic Transformation of Northwest Europe. At the time I noted it because the Bell Beaker people who arrived ~2500 BC seem to have been darker than modern Britons. In particular, you can see that their frequencies are much lower at the blue/brown eye locus (HERC2/OCA2), and SLC45A2, where Europeans are 90% derived today and non-Europeans far less (less than 50% in the Middle East). In modern European populations, the Sardinians have the lowest fraction of the derived SLC45A2 SNP that I’ve seen, around 60%, with mainland Spaniards being at 80%, the rest of Southern Europe at 90%, and 95% in Northern Europe. The Bell Beakers look to be in the low 60% range.

These numbers came back to me when I was looking at some supplementary excel sheets from Genetic ancestry changes in Stone to Bronze Age transition in the East European plain. Here are the figures at these two SNPs for the Fatyanovo Culture of European Russia ~2500 BC:

OCA2/HER2 – 50%
SLC45A2 – 62%

For the Sintashta culture from Russia/Urals ~2000 BC:

OCA2/HER2 – 42%
SLC45A2 – 92%

For comparison,  modern Estonians are 92% and 99% at these markers for the derived variant.

This reiterates something I’ve noticed in the data, Bronze Age Europeans were not as “fair” as modern Europeans. This is pretty evident in Northern Europe in particular since these populations are so fair contemporaneously. And, Bell Beakers and Fantyanovo looked basically the same in terms of pigmentation despite between on opposite ends of the post/para-Corded Ware horizon. Curiously, the Sintashta, who descend in a straight line from Fatyanovo seems to exhibit some selection on SLC45A2 (the sample size is pretty large).

Lewontin’s Paradox in the 21st century

Why do species get a thin slice of π? Revisiting Lewontin’s Paradox of Variation:

Under neutral theory, the level of polymorphism in an equilibrium population is expected to increase with population size. However, observed levels of diversity across metazoans vary only two orders of magnitude, while census population sizes (Nc) are expected to vary over several. This unexpectedly narrow range of diversity is a longstanding enigma in evolutionary genetics known as Lewontin’s Paradox of Variation (1974). Since Lewontin’s observation, it has been argued that selection constrains diversity across species, yet tests of this hypothesis seem to fall short of explaining the orders-of-magnitude reduction in diversity observed in nature. In this work, I revisit Lewontin’s Paradox and assess whether current models of linked selection are likely to constrain diversity to this extent. To quantify the discrepancy between pairwise diversity and census population sizes across species, I combine genetic data from 172 metazoan taxa with estimates of census sizes from geographic occurrence data and population densities estimated from body mass. Next, I fit the relationship between previously-published estimates of genomic diversity and these approximate census sizes to quantify Lewontin’s Paradox. While previous across-taxa population genetic studies have avoided accounting for phylogenetic non-independence, I use phylogenetic comparative methods to investigate the diversity census size relationship, estimate phylogenetic signal, and explore how diversity changes along the phylogeny. I consider whether the reduction in diversity predicted by models of recurrent hitchhiking and background selection could explain the observed pattern of diversity across species. Since the impact of linked selection is mediated by recombination map length, I also investigate how map lengths vary with census sizes. I find species with large census sizes have shorter map lengths, leading these species to experience greater reductions in diversity due to linked selection. Even after using high estimates of the strength of sweeps and background selection, I find linked selection likely cannot explain the shortfall between predicted and observed diversity levels across metazoan species. Furthermore, the predicted diversity under linked selection does not fit the observed diversity–census-size relationship, implying that processes other than background selection and recurrent hitchhiking must be limiting diversity.

Natural selection continues (in the Viking world)


Nature has published a new Viking genomics paper. This morning I didn’t even bother to check it out, as I had other things going on, and there’s been so much ancient DNA from Scandinavia that my thought was “what else could we learn?” Well, it turns out I should have checked it out. The sample size is large enough that it reinforces and nails home the important point that natural selection in many traits has been continuing across the world.

Population genomics of the Viking world:

The maritime expansion of Scandinavian populations during the Viking Age (about AD 750–1050) was a far-flung transformation in world history1,2. Here we sequenced the genomes of 442 humans from archaeological sites across Europe and Greenland (to a median depth of about 1×) to understand the global influence of this expansion. We find the Viking period involved gene flow into Scandinavia from the south and east. We observe genetic structure within Scandinavia, with diversity hotspots in the south and restricted gene flow within Scandinavia. We find evidence for a major influx of Danish ancestry into England; a Swedish influx into the Baltic; and Norwegian influx into Ireland, Iceland and Greenland. Additionally, we see substantial ancestry from elsewhere in Europe entering Scandinavia during the Viking Age. Our ancient DNA analysis also revealed that a Viking expedition included close family members. By comparing with modern populations, we find that pigmentation-associated loci have undergone strong population differentiation during the past millennium, and trace positively selected loci—including the lactase-persistence allele of LCT and alleles of ANKA that are associated with the immune response—in detail. We conclude that the Viking diaspora was characterized by substantial transregional engagement: distinct populations influenced the genomic makeup of different regions of Europe, and Scandinavia experienced increased contact with the rest of the continent.

The phylogenetic patterns are not surprising at all. I’ve looked at enough Scandinavian genomes from Norway, Sweden, and Denmark, to be able to intuitively figure out the sources of random genomes without a label as long as I know they’re Nordic. The Danes will be south-shifted, the Swedes will be Finn-shifted (unless they’re from the far south across from Denmark), while the Norwegians will be neither. Basically this massive ancient DNA transect just confirms that things such as geographic proximity matters, and, that differential population size matters.

Gene flow from Denmark to Sweden, and from continental Europe into Denmark, is not surprising. This follows naturally from different population sizes, and after extensive Christianization of Denmark, the marriage networks of northern Germany and further south no doubt included Denmark. Perhaps of more interest is confirmation of reflux gene flow from the British Isles into Scandinavia. Some of these individuals may have been slaves, but also likely would be people of mixed background, as was the norm in Iceland Greenland, or even individuals who assimilated into totality to the Scandinavian culture through induction into warbands.

There are lots of details of phylogenomic note. For example, look in the supplements, and it seems that the “Picts” were pretty generic post-Bell Beaker people. Their “mystery” is somewhat solved? On the whole, most of the genomic variation of Northern Europe was established by the Bronze Age, but not all. On the margins, there are subtle and nuanced stories you can tell, and you need a sample size this large to tell that.

The most interesting aspect though is that this dataset confirms what many of us have suspected and seen in other results more tentatively: natural selection on complex traits is reshaping the human genome, in the past, and now. In 2016 Field et al. came out with a paper using pretty intense genomic methods to detect lots of sweeps in the European genome recently, and continuing. The method was persuasive, but the results were perplexing. I didn’t know if they were some strange artifact or not, and when I asked people in that lab at ASHG many of them weren’t sure either. Ancient DNA shows us that these were not artifacts or flukes, the allele frequencies have been changing over the last 2,000 years.

Last year last year I noticed that ancient DNA from the Baltic indicates that these people, the palest in the world using most measures, have gotten more lightly complected since the Iron Age. Noticeably so. If you look at the supplements of this paper the pigmentation loci don’t make it as clear. I think on the whole Vikings would not be visually distinctive from modern Scandinavians. But their statistical method makes it hard to refute that this ancient DNA transect is indicative of a reduction in frequency associated with very dark hair in Scandinavia. The fact that this happened in both the western and eastern Baltic region with culturally distinctive people tells me that some underlying cultural or more likely environmental pressure was being applied.

And, it is clear we don’t know the whole story with lactase persistence. Denmark and southern Sweden have among the highest percentages in the world, and that’s clearly not a function of the deep past, but sweeps continuing down into the present.

Are Scandinavians exceptional? I doubt it. It’s just that the climate and concentration of researchers mean that there is a whole lot of study and analysis of many individuals across Holocene time periods. Rather, think of them as a “model organism.” Evolution isn’t done with our species, not by a long-shot, and though we can detect a lot of selection in the genome…there is very little clarity why the selection is occurring (i.e., what are humans adapting to?).*

* Most human population geneticists seem to be now coming to a consensus that there’s a lot of “soft sweeps” on “standing genetic variation.” Since a lot of these soft sweeps happen at a lot of genomic positions, strong selection for trait x is going to result in side effects on a lot of other traits. The “genetic correlation.”

It’s raining founder events

Click to enlarge

There’s a new preprint on bioRxiv that is very interesting, Reconstructing the history of founder events using genome-wide patterns of allele sharing across individuals:

…To learn about the frequency and evolutionary history of founder events, we introduce ASCEND (Allele Sharing Correlation for the Estimation of Non-equilibrium Demography), a flexible two-locus method to infer the age and strength of founder events. This method uses the correlation in allele sharing across the genome between pairs of individuals to recover signatures of past bottlenecks. By performing coalescent simulations, we show that ASCEND can reliably estimate the parameters of founder events under a range of demographic scenarios, with genotype or sequence data. We apply ASCEND to ~5,000 worldwide human samples (~3,500 present-day and ~1,500 ancient individuals), and ~1,000 domesticated dog samples. In both species, we find pervasive evidence of founder events in the recent past. In humans, over half of the populations surveyed in our study had evidence for a founder events in the past 10,000 years, associated with geographic isolation, modes of sustenance, and historical invasions and epidemics. We document that island populations have historically maintained lower population sizes than continental groups, ancient hunter-gatherers had stronger founder events than Neolithic Farmers or Steppe Pastoralists, and periods of epidemics such as smallpox were accompanied by major population crashes. Many present-day groups–including Central & South Americans, Oceanians and South Asians–have experienced founder events stronger than estimated in Ashkenazi Jews who have high rates of recessive diseases due to their history of founder events. In dogs, we uncovered extreme founder events in most groups, more than ten times stronger than the median strength of founder events in humans. These founder events occurred during the last 25 generations and are likely related to the establishment of dog breeds during Victorian times. Our results highlight a widespread history of founder events in humans and dogs, and provide insights about the demographic and cultural processes underlying these events.

This method is pretty cool because it scales and works on non-phased data (good luck phasing a lot of low coverage of ancient DNA!). Through simulation and comparison to earlier results, the authors show that ASCEND does a good job estimating

1) the timing of a founder event

2) the intensity of a founding event

One of my hobby-horses is that Ashkenazi Jews aren’t really that inbred or bottlenecked a group. They’ve been extensively studied, so there’s a laser-like focus on their population and medical genetics. Importantly, they also have a recessive disease load, usually attributed to their endogamy and small effective population size. Studying Ashkenazi Jewish genetics is easy if you think of it in grant terms since there are diseases that are well known you can focus on.

But one of the results in this preprint, which aligns with other earlier published work, is that there are many groups far more homogeneous due to extreme founder events/endogamy than Ashkenazi Jews. Some of the outcomes are not surprising. Lots of South Asian groups seem to be extremely homogeneous due to endogamy and small founding populations, though today many of them number in the millions. The strong implication from these results is that they carry a lot of deleterious recessive allele load.

The other groups are not surprising. Islanders, hunter-gatherers in marginal habits. Basically, populations artificially prevented from gene flow, or, those subject to strong cultural barriers.

The method not only estimates the intensity of the founder events but also the period. Many of the results are totally explicable. Many Northern Europeans seem to have founding populations that go back to the Corded Ware expansion. The founding of the Basque dates to the Roman Empire. Why? I think a reasonable hypothesis that for whatever reason this is when the ancient Aquitani emerged as an exclusive ethnocultural group, as opposed to Romanizing like their Iberian and Celtiberian neighbors.

Probably the most interesting result for me is one that is obvious in much of the data, but hasn’t been analyzed as thoroughly before: ancient European hunter-gatherers had very small effective populations due to narrow founder events. The question is: is this true in general for pre-agricultural people? Many anthropologists have argued that large agglomerations of sedentary populations were more common before the Holocene than we might think, and modern hunter-gatherers are biased samples (they occupy marginal territory).

As we obtain more ancient DNA that question will be answered in the generality. Over ten years ago Hawks et al. argued that large populations resulted in faster adaptation. Whatever details one might quibble within their model, I think the results from ancient DNA raise the possibility of the greater relative efficacy of selection (due to weaker drift) and more population connectedness allowing for easier flow of beneficial alleles.

The software is already available. I’m going to take it for a test drive…

Selection for pigmentation loci…but not pigmentation?


About a year and a half ago at ASHG, I had a discussion with Dan Ju and Iain Mathieson about their work on ancient pigmentation. Or, more precisely, ancient pigmentation related genes. Now it’s out in a preprint, The evolution of skin pigmentation associated variation in West Eurasia:

…It is unclear whether selection has operated on all the genetic variation associated with skin pigmentation as opposed to just a small number of large-effect variants. Here, we address this question using ancient DNA from 1158 individuals from West Eurasia covering a period of 40,000 years combined with genome-wide association summary statistics from the UK Biobank. We find a robust signal of directional selection in ancient West Eurasians on skin pigmentation variants ascertained in the UK Biobank, but find this signal is driven mostly by a limited number of large-effect variants. Consistent with this observation, we find that a polygenic selection test in present-day populations fails to detect selection with the full set of variants; rather, only the top five show strong evidence of selection. Our data allow us to disentangle the effects of admixture and selection. Most notably, a large-effect variant at SLC24A5 was introduced to Europe by migrations of Neolithic farming populations but continued to be under selection post-admixture. This study shows that the response to selection for light skin pigmentation in West Eurasia was driven by a relatively small proportion of the variants that are associated with present-day phenotypic variation.

There are a lot of moving parts in this preprint. Look closely, and you will notice that the authors are careful to stipulate that they can’t really infer the pigmentation of ancient peoples, only the alleles ascertained in modern populations. This matters, because naive deployments of polygenic risk score models trained on modern populations projected on ancient ones seem highly suspect. I’m thinking here mostly of the “Cheddar Man is black” meme. It is true that using modern SNP batteries Mesolithic Europeans are predicted to be rather dark-skinned, but higher latitude humans tend to be paler, on average, than lower latitude humans (albeit, not as pale as the typical Northern European!). But, we can be sure about the alleles we do know about, and, their likely effect (the functional understanding of these pathways is pretty good).

The best modern genetic analyses of pigmentation suggest that variation is dominated by some large-effect loci, but that there is a large residual of smaller-effect loci segregating within the population (I’ve seen 50% accounted for with SNPs, and 50% as “ancestry”, which really masks small-effect QTLs). This is in contrast with the architecture in height, where there are few large-effect loci, and almost all of the variance is small-effect loci. What Ju et al. confirm is that selection “for pigmentation” is due to the large-effect loci; there’s no polygenic selection detectable on the smaller-effect loci for the ancient populations. Importantly, the change in allele frequency isn’t just due to admixture. It’s also due to selection after admixture.

I use quotes above because honestly, I think these sorts of results make it unclear what the selection was for. The general prior is conditioned on the fact that even after a few decades we still think of EDAR as a hair-thickness gene, but it’s one of the strongest signals of selection in the human genome. The “light” allele in SLC24A5 is at an incredibly high frequency in Europe today, and has increased in the last 4,000 years. Though this SNP is impactful for the complexion, it’s hard to imagine how strong selection must be to drive it from 95% to 99.5% (as per 2005 paper on this SNP, the “light” allele exhibits some phenotypic dominance).

As noted in the preprint, there’s not enough data on other regions of the world. It’s hard to assess what’s going on Europe without assessing other regions. The authors do present an intriguing suggestion: that lighter pigmentation in East Asia is driven by smaller-effect genes shifted through polygenic selection.

I’ll present a strange hypothesis: selection for lighter skin at high latitudes through polygenic selection on standing variation naturally takes populations to the coloring of Northeast Asians. But very light complexion, as you see in Northern Europe, could be due to strong selection on the large-effect pigmentation genes, and pigmentation itself may simply be a side effect due to a genetic correlation with the true target of selection.

Assessing the utility of models in ancient DNA admixture analyses

Assessing the Performance of qpAdm: A Statistical Tool for Studying Population Admixture:

qpAdm is a statistical tool for studying the ancestry of populations with histories that involve admixture between two or more source populations. Using qpAdm, it is possible to identify plausible models of admixture that fit the population history of a group of interest and to calculate the relative proportion of ancestry that can be ascribed to each source population in the model. Although qpAdm is widely used in studies of population history of human (and non-human) groups, relatively little has been done to assess its performance. We performed a simulation study to assess the behavior of qpAdm under various scenarios in order to identify areas of potential weakness and establish recommended best practices for use. We find that qpAdm is a robust tool that yields accurate results in many cases, including when data coverage is low, there are high rates of missing data or ancient DNA damage, or when diploid calls cannot be made. However, we caution against co-analyzing ancient and present-day data, the inclusion of an extremely large number of reference populations in a single model, and analyzing population histories involving extended periods of gene flow. We provide a user guide suggesting best practices for the use of qpAdm.

The Reich lab provides its software and data. It’s really not that hard to replicate and tweak some of the analyses they do in their papers (check the supplements for the detailed specifications of the parameters). I’ve done many times when I got curious about a detail they hadn’t explored.

The preprint above is a valuable addition to the intuitions one can develop through using the packages.