Tutorial to run supervised admixture analyses

ID Dai Gujrati Lithuanians Sardinian Tamil
razib_23andMe 0.14 0.26 0.02 0.00 0.58
razib_ancestry 0.14 0.26 0.02 0.00 0.58
razib_ftdna 0.14 0.26 0.02 0.00 0.57
razib_daughter 0.05 0.14 0.29 0.18 0.34
razib_son 0.07 0.17 0.28 0.19 0.30
razib_son_2 0.06 0.19 0.29 0.19 0.27
razib_wife 0.00 0.07 0.55 0.38 0.00

This is a follow-up to my earlier post, Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command. Hopefully, you’ll be able to run supervised admixture analysis with less hassle after reading this. Here I’m pretty much aiming for laypeople. If you are a trainee you need to write your own scripts. The main goal here is to allow people to run a lot of tests to develop an intuition for this stuff.

The above results are from a supervised admixture analysis of my family and myself. The fact that there are three replicates of me is because I converted my 23andMe, Ancestry, and Family Tree DNA raw data into plink files three times. Notice that the results are broadly consistent. This emphasizes that discrepancies between DTC companies in their results are due to their analytic pipeline, not because of data quality.

The results are not surprising. I’m about ~14% “Dai”, reflecting East Asian admixture into Bengalis. My wife is ~0% “Dai”. My children are somewhere in between. At a low fraction, you expect some variance in the F1.

Now below are results for three Swedes with the same reference panel:

Group ID Dai Gujrati Lithuanians Sardinian Tamil
Sweden Sweden17 0.00 0.09 0.63 0.28 0.00
Sweden Sweden18 0.00 0.08 0.62 0.31 0.00
Sweden Sweden20 0.00 0.05 0.72 0.23 0.00

All these were run on supervised admixture frameworks where I used Dai, Gujrati, Lithuanians, Sardinians, and Tamils, as the reference “ancestral” populations. Another way to think about it is: taking the genetic variation of these input groups, what fractions does a given test focal individual shake out at?

The commands are rather simple. For my family:
bash rawFile_To_Supervised_Results.sh TestScript

For the Swedes:
bash supervisedTest.sh Sweden TestScript

The commands need to be run in a folder: ancestry_supervised/.

You can download the zip file. Decompress and put it somewhere you can find it.

Here is what the scripts do. First, imagine you have raw genotype files downloaded fromy 23andMe, Ancestry, and Family Tree DNA.

Download the files as usual. Rename them in an intelligible way, because the file names are going to be used for generating IDs. So above, I renamed them “razib_23andMe.txt” and such because I wanted to recognize the downstream files produced from each raw genotype. Leave the extensions as they are. You need to make sure they are not compressed obviously. Then place them all in  RAWINPUT/.

The script looks for the files in there. You don’t need to specify names, it will find them. In plink the family ID and individual ID will be taken from the text before the extension in the file name. Output files will also have the file name.

Aside from the raw genotype files, you need to determine a reference file. In REFERENCEFILES/ you see the binary pedigree/plink file Est1000HGDP. The same file from the earlier post. It would be crazy to run supervised admixture on the dozens of populations in this file. You need to create a subset.

For the above I did this:
grep "Dai\|Guj\|Lithua\|Sardi\|Tamil" Est1000HGDP.fam > ../keep.keep

./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript

When the script runs, it converts the raw genotype files into plink files, and puts them in INDIVPLINKFILES/. Then it takes each plink file and uses it as a test against the reference population file. That file has a preprend on group/family IDs of the form AA_Ref_. This is essential for the script to understand that this is a reference population. The .pop files are automatically generated, and the script inputs in the correct K by looking for unique population numbers.

The admixture is going to be slow. I recommend you modify runadmixture.pl by adding the number of cores parameters so it can go multi-threaded.

When the script is done it will put the results in RESULTFILES/. They will be .csv files with strange names (they will have the original file name you provided, but there are timestamps in there so that if you run the files with a different reference and such it won’t overwrite everything). Each individual is run separately and has a separate output file (a .csv).

But this is not always convenient. Sometimes you want to test a larger batch of individuals. Perhaps you want to use the reference file I provided as a source for a population to test? For the Swedes I did this:
grep "Swede" REFERENCEFILES/Est1000HGDP.fam > ../keep.keep

./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out INDIVPLINKFILES/Sweden

Please note the folder. There are modifications you can make, but the script assumes that the test files are inINDIVPLINKFILES/. The next part is important. The Swedish individuals will have AA_Ref_ prepended on each row since you got them out of Est1000HGDP. You need to remove this. If you don’t remove it, it won’t work. In my case, I modified using the vim editor:
vim Sweden.fam

You can do it with a text editor too. It doesn’t matter. Though it has to be the .fam file.

After the script is done, it will put the .csv file in RESULTFILES/. It will be a single .csv with multiple rows. Each individual is tested separately though, so what the script does is append each result to the file (the individuals are output to different plink files and merged in; you don’ t need to know the details). If you have 100 individuals, it will take a long time. You may want to look in the .csv file as the individuals are being added to make sure it looks right.

The convenience of these scripts is that it does some merging/flipping/cleaning for you. And, it formats the output so you don’t have to.

I originally developed these scripts on a Mac, but to get it to work on Ubuntu I made a few small modifications. I don’t know if it still works on Mac, but you should be able to make the modifications if not. Remember for a Mac you will need the make versions of plink and admixture.

For supervised analysis, the reference populations need to make sense and be coherent. Please check the earlier tutorial and use the PCA functions to remove outliers.

Again, here is the download to the zip files. And, remember, this only works on Ubuntu for sure (though now I hear it’s easy to run Ubuntu in Windows).

The fault in our parameters

Of the books I own, Elements of Evolutionary Genetics is one I consult frequently because of its range and comprehensiveness. The authors, Brian Charlesworth and Deborah Charlesworth’s encyclopedic knowledge of the literature. To truly understand the evolutionary process in all its texture and nuance it is important to absorb a fair amount of theory, and Elements of Evolutionary Genetics does do that (though it’s not as abstruse as something like An Introduction to Population Genetics Theory).

When I see a paper by one of the Charlesworths, I try to read it. Not because I have a love of Drosophila or Daphnia, but because to develop strong population-genetics intuitions it always helps to stand on the shoulders of giants. So with that, I pass on this preprint, Mutational load, inbreeding depression and heterosis in subdivided populations:

This paper examines the extent to which empirical estimates of inbreeding depression and inter-population heterosis in subdivided populations, as well as the effects of local population size on mean fitness, can be explained in terms of estimates of mutation rates, and the distribution of selection coefficients against deleterious mutations provided by population genomics data. Using results from population genetics models, numerical predictions of the genetic load, inbreeding depression and heterosis were obtained for a broad range of selection coefficients and mutation rates. The models allowed for the possibility of very high mutation rates per nucleotide site, as is sometimes observed for epiallelic mutations. There was fairly good quantitative agreement between the theoretical predictions and empirical estimates of heterosis and the effects of population size on genetic load, on the assumption that the deleterious mutation rate per individual per generation is approximately one, but there was less good agreement for inbreeding depression. Weak selection, of the order of magnitude suggested by population genomic analyses, is required to explain the observed patterns. Possible caveats concerning the applicability of the models are discussed.

Burmese are a bit Bengali

About ten years ago I read the book The River of Lost Footsteps: Histories of Burma. Though I have read books where Burma figures prominently (e.g., Strange Parallels), this is the only history of Burma I have read. The author is Burmese, and provide something much more than a travelogue, as might have been the case if he was of Western background. By chance over the past month or so I’ve been in contact with the author, who made a few inquiries as to the genetics of his own family (he came with genotypes in hand). But this brought us to the issue of the genetics of the Burmese people, and their position in the historical-genetic landscape.

The author of The River of Lost Footsteps reminded me of something that’s curious about Southeast Asia: its Indic influences tend to be from the south of the subcontinent. In particular, the native scripts derive from a South Indian parent. Could genetics confirm this connection as well? Also, could genetics give some insights as to the timing of admixture/gene-flow?

In theory, yes.

I had a lot of Southeast Asian datasets to play with, and did a lot of pruning to remove outliers (e.g., people with obvious recent Chinese ancestry). First, comparing them to Bangladeshis it seems that even without local ancestry tract analysis that Burmese and Malays have more varied, and so likely recent, exogenous ancestry than Bangladeshis. At least this is evidence on the PCA plot, where these two groups exhibit strong admixture clines toward South Asians.

But what about the question of Southeast Asian affinities? This needs deeper analysis. Three-population tests, which measure admixture with outgroups when compared to a dyad of populations which are modeled as a clade, can be informative.

Outgroup Pop1 Pop2 f3 z
Bangladeshi Telugu Cambodians -0.00183999 -46.3322
Bangladeshi Telugu Han -0.00220121 -46.046
Burma Telugu Han -0.00406071 -51.0018
Burma Han Bangladeshi -0.00348186 -49.1398
Burma Han Punjabi_ANI_2 -0.00418193 -47.2351
Cambodians Telugu Viet -0.00126923 -16.91
Cambodians Punjabi_ANI_2 Viet -0.00129881 -15.6039
Cambodians Bangladeshi Viet -0.000970022 -14.5642
Malay Igorot Telugu -0.00249795 -18.758
Malay Igorot Bangladeshi -0.00223454 -18.5212
Malay Igorot Punjabi_ANI_2 -0.00250732 -18.3027
Malay Igorot Cambodians -0.00107817 -16.6214
Viet Han Cambodians -0.000569337 -13.1139

Bangladeshis show strong signatures with both Cambodians and Han. This is in accordance with earlier analysis which suggests Austro-Asiatic and Tibeto-Burman contributions to the “East Asian” element of Bengali ancestry. The Burmese always have Han ancestry, with a South Asian donor as well. This aligns with other PCA analysis which shows the Burmese samples skewed toward Han Chinese. Burma is a compound of different ethnic groups. Some are Austro-Asiatic. The Bamar, the core “Burman” group, have some affinities to Tibetans. And the Shan are a Thai people who are relatively late arrivals.

Cambodians have a weaker admixture signature and are paired with a South Asian group and their geographic neighbors the Vietnamese. The Malays are similar to Cambodians but have the Igorot  people from the Philippines as one of their donors. And finally, not surprisingly the Vietnamese show some mixture between Han-like and Cambodian-like ancestors.

Further PCA analysis shows that while Cambodians and Malays tend to skew somewhat neutrally to South Asians (the recent Indian migration to Malaysia is mostly Tamil), the Burmese are shifted  toward Bangladeshis:

Click to enlarge

Finally, I ran some admixture analyses.

First, I partitioned the samples with an unsupervised set of runs (K = 4 and K = 5). In this way I obtained reified reference groups as follows:

“Austronesians” (Igorot tribesmen from the Philippines)
“Austro-Asiatic” (a subset of Cambodians with the least exogeneous admixture)
“North Indians” (Punjabis)
“South Indians” (A subset of middle-caste Telugus highest on the modal element in South Indians)
“Han” (a proxy for “northern” East Asian)

The results are mostly as you’d expect. In line with three-population tests, the Vietnamese are Han and Austro-Asiatic. More of the former than latter. There is a minor Austronesian component. Notice there is no South Asian ancestry in this group.

In contrast, Cambodians have low levels of both North and South Indian. These out sample Cambodians are still highly modal for Austro-Asiatic though.

Malays are more Austro-Asiatic than Austronesian, which might surprise. But the Igorot samples are highly drifted and distinct. I think these runs are underestimating Austronesian in the Malays. Notice that some of the Malays have South Asian ancestry, but a substantial number do not. This large range in admixture is what you see in PCA as well. I think this strongly points to the fact that Malays have been receiving gene-flow from India recently, as it is not a well mixed into the population.

The Bangladeshi outgroup is mostly a mix of North and South Indian, with a slight bias toward the latter. No surprise. As I suggested earlier you can see that the Bangladeshi samples are hard to model as just a mix of Burmese with South Asians. The Austro-Asiatic component is higher in them than the Burmese. This could be because Burma had recent waves of northern migration (true), and, eastern India prior to the Indo-Aryan expansion was mostly inhabited by Austro-Asiatic Munda (probably true). That being said, the earlier analysis suggested that the Munda cannot be the sole source of East Asian ancestry in Bengalis.

Finally, every single Burmese sample has South Asian ancestry. Much higher than Cambodians. And, there is variance.  I think that leads us to the likely conclusion that Burma has been subject to continuous gene-flow as well as recent pulses of admixture from South Asia. The variation in South Asian ancestry in the Burmese is greater than East Asian ancestry in Bengalis. I believe this is due to more recent admixture in Burmese due to British colonial Indian settlement in that country.

The cultural and historical context of this discussion is the nature of South Asian, Indic, influence, on Southeast Asia. One can not deny that there has been some gene-flow between Southeast Asia and South Asia. In prehistoric times it seems that Austro-Asiatic languages moved from mainland Southeast Asia to India. More recently there is historically attested, and genetically confirmed, instances of colonial Indian migration. But, the evidence from Cambodia suggests that this is likely also ancient, as unlike Malaysia or Burma, Cambodia did not have any major flow of Indian migrants during the colonial period. One could posit that perhaps the Cambodian Indian affinity is a function of “Ancestral South Indian.” But the Cambodians are not skewed toward ASI-enriched groups in particular. And, I know for a fact that appreciable frequencies of R1a1a exist within the male Khmer population (this lineage is common in South Asia, especially the north and upper castes).

As far as Burma goes, I think an older period of South Indian cultural influence, and some gene-flow seems likely. But, with the expansion of Bengali settlement to the east over the past 2,000 years, more recent South Asian ancestry is probably enriched for that ethnolinguistic group.

I’m going to try and follow-up with some ancestry tract analysis….

Soft & hard selection vs. soft & hard sweeps

When I was talking to Matt Hahn I made a pretty stupid semantic flub, confusing “soft selection” with “soft sweeps.” Matt pointed out that soft/hard selection were terms more appropriate to quantitative genetics rather than population genomics. His viewpoint is defensible, though going back into the literature on soft/selection, e.g., Soft and hard selection revisited, the main thinkers pushing the idea were population geneticists who were also considering ecological questions.*

The strange thing is that I had already known the definitions of hard and soft selection on some level because I had read about them as I was getting confused with hard and soft sweeps! But this was more than ten years ago now, and since then I haven’t given the matter enough thought obviously, as I defaulted back to confusing the two classes of terms, just as I used to.

Matt pointed out that truncation selection is a form of hard selection. All individuals below (or above) a certain phenotype value have a fitness of zero, as they don’t reproduce. In a single locus context, hard selection would involve deleterious lethal alleles, whose impact on the genotype was the same irrespective of ecological context. So in a hard selection, it operates by reducing the fitness of individuals/genotypes to zero.

For soft selection, context matters much more, and you would focus more on relative fitness differences across individuals/genotypes. Some definitions of soft vs. hard selection emphasize that in the former case fitness is defined relative to the local ecological patch, while the latter is a universal estimate. Soft selection does not necessarily operate through the zero fitness value for a genotype, but rather differential fitness. Hard selection can crash your population size. Soft selection does not necessarily do that.

Though I won’t outline the details, one of the originators of the soft/hard selection concept analogized them to density-dependent/independent dynamics in ecology. If you know the ecological models, the correspondence probably is obvious to you.

As for hard and soft sweeps, these are particular terms of relevance to genomics, because genome-wide data has allowed for their detection through the impact they have on the variation in the genome. A “sweep” is a strong selective event that tends to sweep away variation around the focus of selection. A hard sweep begins with a single mutant, and positive selection tends to drive it toward fixation.

A classical example is lactase persistence in Northern Europeans and Northwest South Asians (e.g., Punjabis). The mutation in the LCT gene is the same across a huge swath of Eurasia. And, the region around the genome is also the same, because regions of the genome adjacent to that single mutation increased in frequency as well (they “hitchhiked”). This produces a genetic block of highly reduced diversity since the hard selective sweep increases the frequency of so many variants which are associated with the advantageous one, and may drive to extinction most other competitive variants.

Someone is free to correct me in the comments, but it strikes me that many hard selective sweeps are driven by soft selection. Fitness differentials between those with the advantageous alleles and those without it are not so extreme, and obviously context dependent, even in cases of hard sweeps on a single locus.

The key to understanding soft sweeps is that there isn’t a focus on a singular mutation. Rather, selection can target multiple mutations, which may have the same genetic position, but be embedded within different original gene copies. In fact, soft selection often operates on standing variation, preexistent alleles which were segregating in the population at low frequencies or were totally neutral. Genetic signatures of these events are less striking than those for hard sweeps because there is far less diminishment of diversity, since it’s not the increase in the frequency of a singular mutation and the hitchhiking of its associated flanking genomic region.

Soft sweeps can clearly occur with soft selection. But truncation selection can occur on polygenic traits, so depending on the architecture of the trait (i.e., effect size distribution across the loci) one can imagine them associated with hard selection as well.

Going back to the conversation I had with Matt the reason semantics is important is that terms in population genetics are informationally rich, and lead you down a rabbit-hole of inferences. If population genetics is a toolkit for decomposing reality, then you need to have your tools well categorized and organized. On occasion it is important to rectify the names.

* There are two somewhat related definitions of soft/hard selection. I’ll follow Wallace’s original line here, though I’m not sure they differ that much.

The mutation accumulation controversy continues….

Every few years I check to see if the great mutation accumulation controversy has resolved itself. I don’t know if anyone calls it that, but that’s what I think of it as. There are two major issues that matter here: mutation rates are a critical parameter in evolutionary models, and, mutation accumulation over time matters for parental age effects when it comes to disease (speaking as an older father!).

In the latter case, I’m talking about the reasons that people freeze their eggs or sperm. In the former case, I’m talking about whether we can easily extrapolate mutation rates over evolutionary time as semi-fixed, so we can infer dates of last common ancestry and such. To give a concrete example of what I’m talking about, if mutation rates varied a lot over the evolutionary history of our hominin lineage, then we might need to rethink some of the inferred timings.

Today two preprints came out on mutation accumulation. First, Overlooked roles of DNA damage and maternal age in generating human germline mutations. Second, Reproductive longevity predicts mutation rates in primates. What a coincidence in synchronicity!

Additionally, the last author on the second preprint, Matt Hahn, is someone I’ll be doing a podcast with this week. So aside from talking about neutral theory, and his book Molecular Population Genetics, I’m going to have to bring up this mutation business.

The figure above from the first preprint shows that the proportion of mutations derived from the father don’t increase over time, as textbooks generally state. Why would we expect this? Sperm keeps replicating after puberty so you should be gaining more mutations. In contrast, the eggs are arrested in meiosis. There are various mechanistic reasons that the authors of the first preprint give for why the ratio does not change between paternal and maternal mutations (e.g., non-replicative mutations seem to be the primary one). The authors are using a very “pedigree” strategy, rather than an “evolutionary” one. They’re looking at sequenced trios, and noticing patterns. I think in the near future they’ll be far more sure of what’s going on because they’ll have bigger sample sizes. They admit the effects are subtle (also, some of the p-values are getting close to 0.05).

Instead of focusing on a human pedigree, the second preprint does some sequencing on owl monkeys (I had no idea there were “owl monkeys” before this paper). They find that the mutation rate is ~32% lower in owl monkeys than in humans. Why is this?

The plot to the left shows that mutations increase across age with species (though the number of data points is pretty small). The authors contend that:

The association between mutation rates and reproductive longevity implies that changes in life history traits rather than changes to the mutational machinery are responsible for the evolution of these rates. Species that have evolved greater reproductive longevity will have a higher mutation rate per generation without any underlying change to the replication, repair, or proofreading proteins.

If I read this right: owl monkeys reproduce fast and don’t have as much reproductive longevity. Ergo, lower mutation rates (less mutational build-up from paternal side).

After all these years I’m still not convinced about anything. I assume that eventually bigger data sets will come online and we’ll resolve this. Someone has to be right!

(not too many people on Twitter get what’s going on either)

The peoples of the Maghreb have some Pleistocene roots

Moroccan Berber man

The Maghreb is an important and interesting place. In the history of Western civilization, the tension between Carthage, the ancient port city based out of modern-day Tunisia, and Rome, is one of the more dramatic and tragic rivalries that has resonances down through the ages. Read Adrian Goldsworthy’s chapter on the Battle of Cannae in The Punic Wars for what I’m alluding to (and of course there was Cato the Younger’s dramatic remonstrations).

Later Roman Africa, which really encompassed northern Morocco, coastal Algeria, and Tunisia and Tripolitania, became a major social and economic pillar of the Imperium. Not only did men such as the emperor Septimius Severus and St. Augustine have roots in the region, but these provinces were a major economic bulwark for the Western Empire in its last century. The wealthy Senators of the 4th and 5th century were often absentee landlords of vast estates in North Africa. The fall of these provinces to the Vandals and Alans in the 430s began the transformation of the Western Empire based in Rome into a more regional player, rather than a true hegemon (perhaps an analogy here can be made to the loss of Anatolia by the Byzantines in the 11th century).

Another important aspect of North Africa is that it is the westernmost extension of the region possibly settled by Near Eastern farmers in Africa. The native Afro-Asiatic Berber languages seem to have been dominant in the region despite the influence and prestige of Punic and Latin in the cities when Muslim Arabs conquered the region in the late 7th century. The genetic-demographic characteristics of the region are relevant to attempts to understand the origins of the Afro-Asiatic languages more generally since Berber is part of the clade with the Semitic languages.

A preprint and a paper utilizing ancient DNA have shed a great deal of light on these questions recently. The paper is in Science, Pleistocene North African genomes link Near Eastern and sub-Saharan African human populations. The preprint is Ancient genomes from North Africa evidence prehistoric migrations to the Maghreb from both the Levant and Europe. They are in broad agreement, though they cover somewhat different periods.

The figure below is the big finding of the Science paper:

They retrieved some genotypes from a site in northern Morocco, Taforalt, which dates to ~15,000 years before the present. This is a Pleistocene site, before the rise of agriculture. The Taforalt individuals are about 65% Eurasian in affinity, and 35% Sub-Saharan African. This confirms that the Eurasian back-migration to northern Africa predates the Holocene, just as many archaeologists and geneticists have reported earlier.

The samples from the preprint date to a later time. IAM in the samples dates to 7,200 years before the present, and KEB to ~5,000 years before the present. It seems pretty clear that the IAM samples in the preprint exhibit continuity with the Taforalt samples. Though it is not too emphasized in the preprint the lower K’s seem to strongly suggest that the IAM samples have Sub-Saharan African ancestry, just like the Taforalt samples which are nearly 8,000 years older. In the KEB samples, the fraction drops, probably diluted in part by ancestry related to what we elsewhere term “Early European Farmer” (EEF), related to the Anatolian farming expansion.

Both the Taforalt and IAM samples, in particular, seem to exhibit strong affinities to Natufian/Levantine peoples. Additionally, many of these samples carry Y chromosome haplogroup E1b, just like some of the Natufians. These results indicate that the Natufian-North African populations were exchanging genes or one cline rather deep in the Pleistocene.

Though various methods have suggested that there is a lot of recent Sub-Saharan African admixture, dating to the Arab period, in North Africa, these results suggest that much of it is far older. The Mozabites, as an isolated Berber group, reflect this tendency. Though some individuals have inflated African ancestry due to recent admixture, much of it is older and evener. And yet the Mozabites seem to have less Sub-Saharan African ancestry on average than the IAM sample.

There aren’t enough data points to make a strong inference about the temporal transect, but these few results imply a decline in Sub-Saharan ancestral component after the Pleistocene with further farming migration, and then a rise again with the trans-Saharan slave trade during the Muslim period. Another issue, highlighted in the preprint, is likely heterogeneity within the Maghreb in ancestry (lowland populations in modern North Africa tend to have more Sub-Saharan ancestry due to where slaves were settled).

In the Science paper the authors make an attempt to adduce the origin of the Sub-Saharan contribution to the Taforalt individuals. The result is that there is no modern or ancient proxy that totally fits the bill. These individuals have affinities to many Sub-Saharan African populations.  The Sub-Saharan component is likely heterogeneous, but attempts to model European genetic variation during the Ice Age ran into trouble that divergence from modern populations was quite great. Until we get more ancient DNA there probably won’t be too much more clarity.

On the issue of the Eurasian ancestry, it’s clearly quite like the Natufians. But curiously the authors find that the Neanderthal ancestry in these samples is greater than that found in early Holocene Iran samples. From this, the authors conclude that they may have had a lower fraction of “Basal Eurasian” (BEu) than those populations further to the east. But already 15,000 years ago BEu populations were mixed with more generic West Eurasians to generate the back-migration to Africa. If BEu diverged from other Eurasians >50,000 years ago, then it may have merged back into the “Out-of-Africa” populations around or before the Last Glacial Maximum, ~20,000 years ago.

Finally, the authors looked at some pigmentation genes. Curiously the Taforalt and IAM individuals did not carry the derived variants for pigmentation found in many West and South Eurasians, but the KEB did. This confirms results from Europe, and population genomic inference in modern samples, that selection for derived pigmentation variants is relatively recent in the Holocene.

I do want to add that one possibility about the Sub-Saharan ancestry in the Taforalt, and probably all modern North Africans to a lesser extent, is that it is ancient and local. We now know proto-modern humans were present in the region >300,000 years ago. Northwest Africa may have been part of the multi-regional metapopulation of H. sapiens, as opposed to the Eurasian biogeographic zone that it is often placed, before a post-LGM back migration of Eurasians.

Are Turks Armenians under the hood?

Benedict Anderson’s Imagined Communities: Reflections on the Origin and Spread of Nationalism is one of those books I haven’t read, but should. In contrast, I have read Azar Gat’s Nations, which is a book-length counterpoint to Imagined Communities. To take a stylized and extreme caricature, Imagined Communities posits nations to be recent social and historical constructions, while Nations sees them as primordial, and at least originally founded on on ties of kinships and blood.

The above doesn’t capture the subtlety of  Gat’s book, and I’m pretty sure it doesn’t capture that of Anderson’s either. But, those are the caricatures that people take away and project in public, especially Anderson’s (since Gat’s is not as famous).

When it comes to “imagined communities” I recently have been thinking how much that of modern Turks fits into the framework well. Though forms of pan-Turkic nationalism can be found as earlier as 9th-century Baghdad, the ideology truly emerges in force in the late 19th century, concomitantly with the development of a Turkish identity in Anatolia which is distinct from the Ottoman one.

The curious thing is that though Turkic and Turkish identity is fundamentally one of language and secondarily of religion (the vast majority of Turkic peoples are Muslim, and there are periods, such as the 17th century when the vast majority of Muslims lived in polities ruled by people of Turkic origin*), there are some attempts to engage in biologism. This despite the fact that the physical dissimilarity of Turks from Turkey and groups like the Kirghiz and Yakut is manifestly clear.

Several years ago this was made manifestly clear in the paper The Genetic Legacy of the Expansion of Turkic-Speaking Nomads across Eurasia. This paper clearly shows that Turkic peoples across Eurasia have been impacted by the local genetic substrate. In plainer language, the people of modern-day Turkey mostly resemble the people who lived in Turkey before the battle of Manzikert and the migration of Turkic nomads into the interior of the peninsula in the 11th century A.D. Of course, there is some genetic element which shows that there was a migration of an East Asian people into modern day Anatolia, but this component in the minority one.**

Sometimes the Turkish fascination with the biological comes out in strange ways, Turkish genealogy database fascinates, frightens Turks. Much of the discussion has to do with prejudice against Armenians and Jews. But the reality is that most Turks at some level do understand that they are descended from Greeks, Armenians, Georgians, etc.

To interrogate this further I decided to look at a data set of Greeks, Turks, Armenians, Georgians, and a few other groups, including Yakuts, who are the most northeastern of Turkic peoples. The SNP panel was >200,000, and I did some outlier pruning. Additionally, I didn’t have provenance on a lot of the Greeks, except some labeled as from Thessaly. I therefore just split those up with “1” being closest to the Thessaly sample and “3” the farthest.

First, let’s look at the PCA.

Read More

Genetic distances across Eurasia

I feel that for whatever reason that over the past few years that many people have started to exhibit weak intuitions about the magnitude of between population differences on this weblog. Two suggestions for why this might occur.

* First, the proliferation of PCA plots with individuals can make it hard to discern averages

* Second, model-based admixture plots don’t explicitly quantify the differences between the different clusters

To get a better sense of between-group differences I decided to take a step back and look at Fst. Fst basically looks all the genetic variance between groups and quantifies the proportion that can be attributed to differences between groups.

The plot at the top of this post is from an Fst matrix I generated with Plink (I wrote a script to do the pairwise comparison). I did some PCA pruning of the populations to be clear (e.g., with both Cambodians and Filipinos I made them more distinct than they would otherwise be). The goal was to give people a sense of genetic distances within regions and between them.

I also generated a PCA plot and a Treemix plot, for the sake of comparison.

It’s also useful to look at a few group comparisons and judge them in a global context.

Tamil Telugu 0.0011
Tamil Tamil Scheduled Caste 0.0016
Tamil Bangladeshi 0.0024
Tamil South Indian Brahmin 0.0031
Tamil Uttar Pradesh Brahmin 0.0041
Tamil Sindhi 0.0087
Tamil Vietnamese 0.0668
Southern Chinese Northern Chinese 0.0033
Southern Chinese Vietnamese 0.0034
Southern Chinese Korea 0.0045
Southern Chinese Japanese 0.0087
Southern Chinese Tamil 0.0711
Southern Chinese Polish 0.1141
Gujurati_Patel Telugu 0.0062
Gujurati_Patel Uttar Pradesh Brahmin 0.0065
Gujurati_Patel Bangladeshi 0.0069
Gujurati_Patel Velama 0.0094
Gujurati_Patel Sindhi 0.0104
Gujurati_Patel Polish 0.0405
Gujurati_Patel Japanese 0.0781
GreatBritain Ireland 0.0015
GreatBritain Polish 0.0043
GreatBritain Sicily 0.0077
GreatBritain Uttar Pradesh Brahmin 0.0264
GreatBritain Tamil 0.0430
GreatBritain Korea 0.1130

The non-Brahmin and non-Dalit samples in the 1000 Genomes are not much partitioned much by geography. The Tamil vs. Telugu difference is smaller than that between the British and Irish. Within Tamil Nadu Brahmins though are nearly as different from typical Tamils as Poles are from the English (most of the British sample is English). The biggest differences in Europe are between Sicilians and Northern European groups, which similar in a degree to that between South Indians and Pakistanis. The South Chinese sample is nearly as close to Vietnamese as it is to a North Chinese group, while the difference between Koreans and Chinese is relatively small when compared to the variance you see in South Asia and Europe.

Note: Drift tends to inflate Fst.

Natural selection in humans (OK, 375,000 British people)


The above figure is from Evidence of directional and stabilizing selection in contemporary humans. I’ll be entirely honest with you: I don’t read every UK Biobank paper, but I do read those where Peter Visscher is a co-author. It’s in PNAS, and a draft which is not open access. But it’s a pretty interesting read. Nothing too revolutionary, but confirms some intuitions one might have.

The abstract:

Modern molecular genetic datasets, primarily collected to study the biology of human health and disease, can be used to directly measure the action of natural selection and reveal important features of contemporary human evolution. Here we leverage the UK Biobank data to test for the presence of linear and nonlinear natural selection in a contemporary population of the United Kingdom. We obtain phenotypic and genetic evidence consistent with the action of linear/directional selection. Phenotypic evidence suggests that stabilizing selection, which acts to reduce variance in the population without necessarily modifying the population mean, is widespread and relatively weak in comparison with estimates from other species.

The stabilizing selection part is probably the most interesting part for me. But let’s hold up for a moment, and review some of the major findings. The authors focused on ~375,000 samples which matched their criteria (white British individuals old enough that they are well past their reproductive peak), and the genotyping platforms had 500,000 markers. The dependent variable they’re looking at is reproductive fitness. In this case specifically, “rRLS”, or relative reproductive lifetime success.

With these huge data sets and the large number of measured phenotypes they first used the classical Lande and Arnold method to detect selection gradients, which leveraged regression to measure directional and stabilizing dynamics. Basically, how does change in the phenotype impact reproductive fitness? So, it is notable that shorter women have higher reproductive fitness than taller women (shorter than the median). This seems like a robust result. We’ve seen it before on much smaller sample sizes.

The results using phenotypic correlations for direction (β) and stabilizing (γ) selection are shown below separated by sex. The abbreviations are the same as above.


There are many cases where directional selection seems to operate in females, but not in males. But they note that that is often due to near zero non-significant results in males, not because there were opposing directions in selection. Height was the exception, with regression coefficients in opposite directions. For stabilizing selection there was no antagonistic trait.

A major finding was that compared to other organisms stabilizing selection was very weak in humans. There’s just not that that much pressure against extreme phenotypes. This isn’t entirely surprising. First, you have the issue of the weirdness of a lot of studies in animal models, with inbred lines, or wild populations selected for their salience. Second, prior theory suggests that a trait with lots of heritable quantitative variation, like height, shouldn’t be subject to that much selection. If it had, the genetic variation which was the raw material of the trait’s distribution wouldn’t be there.

Using more complex regression methods that take into account confounds, they pruned the list of significant hits. But, it is important to note that even at ~375,000, this sample size might be underpowered to detect really subtle dynamics. Additionally, the beauty of this study is that it added modern genomic analysis to the mix. Detecting selection through phenotypic analysis goes back decades, but interrogating the genetic basis of complex traits and their evolutionary dynamics is new.

To a first approximation, the results were broadly consonant across the two methods. But, there are interesting details where they differ. There is selection on height in females, but not in males. This implies that though empirically you see taller males with higher rLSR, the genetic variance that is affecting height isn’t correlated with rLSR, so selection isn’t occurring in this sex.

~375,000 may seem like a lot, but from talking to people who work in polygenic selection there is still statistical power to be gained by going into the millions (perhaps tens of millions?). These sorts of results are very preliminary but show the power of synthesizing classical quantitative genetic models and ways of thinking with modern genomics. And, it does have me wondering about how these methods will align with the sort of stuff I wrote about last year which detects recent selection on time depths of a few thousand years. The SDS method, for example, seems to be detecting selection for increasing height the world over…which I wonder is some artifact, because there’s a robust pattern of shorter women having higher fertility in studies going back decades.

A genetic map of the world

The above map is from a new preprint on the patterns of genetic variation as a function of geography for humans, Genetic landscapes reveal how human genetic diversity aligns with geography. The authors assemble an incredibly large dataset to generate these figures. The orange zones are “troughs” of gene flow. Basically barriers to gene flow.  It is no great surprise that so many of the barriers correlate with rivers, mountains, and deserts. But the aim of this sort of work seems to be to make precise and quantitative intuitions which are normally expressed verbally.

To me, it is curious how the borders of the Peoples’ Republic of China is evident on this map (an artifact of sampling?). Additionally, one can see Weber’s line in Indonesia. There are the usual important caveats of sampling, and caution about interpreting present variation and dynamics back to the past. But I believe that these sorts of models and visualizations are important nulls against which we can judge perturbations.

As I said, these methods can confirm rigorously what is already clear intuitively. For example:

Several large-scale corridors are inferred that represent long-range genetic similarity, for example: India is connected by two corridors to Europe (a southern one through Anatolia and Persia ‘SC’, and
a northern one through the Eurasian Steppe ‘NC’)

We still don’t have enough ancient DNA to be totally sure, but it’s hard to ignore the likelihood that “Ancestral North Indians” (AN) actually represent two different migrations.

India also illustrates contingency of these barriers. Before the ANI migration, driven by the rise in agricultural lifestyles, there would likely have been a major trough of gene flow on India’s western border. In fact a deeper one than the one on the eastern border. And if the high genetic structure statistics from ancient DNA are further confirmed then the rate of gene flow was possibly much lower between demes in the past. Perhaps that would simply re-standardize equally so that the map itself would not be changed, but I suspect that we’d see many more “troughs” during the Pleistocene and early Holocene.

Because there are so many geographically distributed samples for humans, and frankly some of the best methods developers work with human data (thank you NIH), it is no surprise that our species would be mapped first. But I think some of the biggest insights may be with understanding the dynamics of gene flow of non-human species, and perhaps the nature and origin of speciation as it relates to isolation (or lack thereof).