Long-time readers of this weblog know that about fifteen years ago I dabbled in a little worm-work. At that time I read In the Beginning Was the Worm: Finding the Secrets of Life in a Tiny Hermaphrodite. As Brenner was involved in promoting C. elegans as a model he occupies a lot of this book. I recommend it. It’s short and packed with historical nuggets that make the 21st-century trajectory of science more comprehensible.
I’ve been looking at the data from the recent Munda paper. Standard stuff, admixture, treemix, and f-statistics.The northern Munda samples were collected in Bangladesh. So I thought: I can test the hypothesis that the East Asian ancestry in Bangladesh is to a large part Santhal. After looking at it every which way, I think that in fact, the Munda may not have ever been very populous in much of northeast India. The Santhal is just not a good donor population to Bengalis, at least not when comparing mixes such as Dai + Tamil.
Additionally, the Santhal are really not that well modeled by mixing South Asians with any particular Southeast Asian group, though it works. I think that’s suggestive of the possibility that the Austro-Asiatic group which gave rise to the Munda don’t exist in their current form anywhere in Southeast Asia. Additionally, the Lao samples that are provided in the new paper I think may have Indian ancestry via admixture from Austro-Asiatic Mon or Khmer groups.
Basically, there is so much bidirectional gene flow that I think it’s really hard to get a grip on what’s going on. Additionally, the Burmese and northeast Indian populations (e.g., the Mizos) clearly have a strand of ancestry that derives from relatively recent migrants that came down from the region of eastern Tibet, and perhaps Sichuan or even further north. And this component shows up in Bengalis as well.
On top of this, there is the “Australo-Melanesian” substrate that is present all across Southeast Asia, and probably was present in modern southern China in the early Holocene, which has distant affinities with the “Ancient Ancestral South Indians” (AASI).
At this point, I keep my own counsel. But there may be an interesting story to tell related to how efficient and effective different forms of agriculture were, and how that interplayed with genes and language.
One of the reasons that I don’t post AdmixTools results too much is that the framework requires more statistical “deep thought” than just popping out a PCA or even running some model-based clustering. Read the methods supplements of one of the Reich lab’s papers, and you’ll see what I’m getting at. But a more prosaic reason is that I generally work in the plink format, and format conversion, as well as editing parameter files, is a pain. In general, I don’t do much “exploratory AdmixTools” stuff for a reason.
Martin Petr has made the second excuse a lot less of an excuse. His admixr package gives one an easy interface into AdmixTools. In particular, it allows one not to have to edit parameter files so much. It took me about ~15 minutes to get it downloaded and running. I’m on a Mac and for R use RStudio.
– remember to install wget if you are on a Mac (this will show up if you want to use online datasets)
– You need to make sure to set the path to AdmixTools. In the RStudio console, I just entered:
If you can get AdmixTools installed in the first place, admixr should be very easy.
Like many Americans in the year 2018 I’ve got a whole pedigree plugged into personal genomic services. I’m talking from grandchild to grandparent to great-aunt/uncles. A non-trivial pedigree. So we as a family look closely at these patterns, and we’re not surprised at this point to see really high correlations in some cases compared to what you’d expect (or low).
This means that you can see empirically the variation between relatives of the same nominal degree of separation from a person of interest. For example, each of my children’s’ grandparents contributes 25% of their autosomal genome without any prior information. But I actually know the variation of contribution empirically. For example, my father is enriched in my daughter. My mother is my sons.
The sample principle applies to siblings. Though they should be 50% related on their autosomal genome, it turns out there is variation. I’ve seen some papers large data sets (e.g., 20,000 sibling pairs) which gives a standard deviation of 3.7% in relatedness. But what about other degrees of relation?
About ten years ago I reviewed Bryan Sykes’ book Saxons, Vikings, and Celts: The Genetic Roots of Britain and Ireland. It was what it was, a product of the Y/mtDNA era. Therefore, there were a fair amount of conclusions which in hindsight turn out to be wrong. Sykes, and other genetic historians, such as Stephen Oppenheimer, have annoyed historians for years with their genetic imperialism. More frequently, genetic research has been an accent or inflection on historical work. Peter Heather has integrated some genetic results in his earlier books, though you can ignore those and still obtain the general conclusions.
The recent work on near antiquity is a hint that that is going to be blown apart. Ancient DNA in the historical period has been a slow simmer for a while now. The reason is simple: ancient DNA returns more on the investment for prehistory, where there aren’t historical documents. Until recently ancient DNA techniques were expensive in a variety of ways. The industrial process described in Who We Are and How We Got There is going to change that.
In the near future, a large number of projects are going to surface which test hypotheses and conjectures offered by historians.
You would think that testing hypotheses, generally with demographic predictions, would be something that historians would welcome. The problem is that the test will mean some scholars are going to turn out to be wrong. People who spent decades building up a particular model or understanding of the past are going to have that torn away from them.
The normal human reaction is to get defensive. But the problem is that many historians are not well trained in genetic methods. In fact, many geneticists are not well trained in the abstruse statistical methods developed by scholars in ancient DNA.
We’ve seen some of the same fromarchaeologists. But archaeologists had models which were, to be frank, more speculative than those historians cling to. Even if a particular historical model may be wrong, it is likely there are reasonable grounds to have held onto to that position. If ancient DNA falsifies it the reaction will be even more strident I suspect.
Of course, geneticists need the help of historians. So when the bad feelings clear I think the synthesis will get us to a better understanding of the past.
In the 1000 Genomes, there is a Punjabi dataset. Here is the description:
These cell lines and DNA samples were prepared from blood samples collected in Lahore, Pakistan. The samples are from a mix of parent- adult child trios and unrelated individuals who identified themselves and their parents as Punjabi.
A few years ago I did an analysis of the population structure in the 1000 Genomes dataset. In the Chinese data, there seemed to be some curious structure (there were two clusters of South Chinese). But the biggest issues predictably were in the South Asians. To give concrete examples, there were a few Brahmins in the Telugu data. A subset of Tamils and Telugus were highly ASI shifted. The Gujurati were highly heterogeneous, and one subcluster were almost certainly Patels (the samples were collected in Houston). The ASI shifted groups were almost certainly Scheduled Castes (Dalits) because I could see that they clustered with those samples from Estonian Biocentre dataset.
There was something curious about the samples from Pakistan and Bangladesh. Aside from a small number of individuals, whose samples were collected at the same time judging by their IDs (these individuals cluster with Scheduled Castes), the Bangladeshi sample didn’t have much South Asian style structure. That is, there wasn’t a cline or lots substructure within the ethnicity.
As noted by some commenters, the Punjabi samples were very different. Like the Gujurati samples, there was a huge variance along the ANI-ASI cline. To me, this was somewhat surprising. To make the 1000 Genomes more useful I used PCA and divided both Gujuratis and Punjabis into groups based on their position on the ANI-ASI cline. So that ANI_1 is the subpopulation with the most ANI and ANI_4 the least.
Using Treemix produced some weird results. As you can see above Punjabi_ANI_1 looks like an Iranian population with gene flow from Punjabi_ANI_3. Punjabi ANI_2 looks like a North Indian population with Iranian gene flow (so it is more ASI). Punjabi_ANI_3 are less ANI shifted than Uttar Pradesh Brahmins, but more than Uttar Pradesh Kshatriya. Finally, Punjabi_ANI_4 actually is very similar to Punjabi_ANI_2, except it has gene flow from a Dalit-like population.
I don’t know what’s going on here. Is this really caste-like structure in Punjab? Or are we see lots of admixture of people who are called “Punjabi” today? For example, the gene flow edges suggest lots of mixing between quite South Asian types of groups and an Iranian sort. Perhaps this is the absorption of Pathans into South Asian groups? Could it be Muhajir people who mixed with local Punjabis and identified as such?
I was curious to see if I could find something similar in relation to the three Jatts. As you can see with Treemix, no. Jatts are just very ANI-shifted. I added Lithuanians and Georgians, and you can see that Uttar Pradesh Brahmins get gene flow from a Lithuanian shifted group, while South Indian Brahmins have a more Georgian gene flow. This is just an artifact I suspect of the fact that South Indian Brahmins have a lot of admixture from non-Brahmin South Indians, who are more Georgian than Lithuanian (Iran_N as opposed to Yamnaya).
Finally, going back to the Bengali (Bangladeshi) vs. Punjabi contrast, it is really interesting. If Punjab has such deep caste-like structures it really goes to show how within South Asia caste is a very very powerful institution, and ~1,000 years of Muslim rule and in western Punjab a majority Muslim population did not break down the institution. In contrast, in Bangladesh, there doesn’t seem to be much caste structure. I am routinely the most East Asian shifted Bengali in datasets, but my family is also from the eastern edge of eastern Bengal. Why the difference?
in The Rise of Islam and the Bengal Frontier the author posits that the Islamicization of eastern Bengal was to a great extent the function of the opening up of lands for cultivation under the supervision of Muslim elites under the rule of Afghans and later Mughals. This would explain the lack of caste structure because presumably, caste structure would be difficult to maintain in a frontier landscape, where the cultural elite does not promote or accept caste (though the elite West Asian Muslims were racially exclusive, they were also a very small minority).
In contrast, the Punjab has long been settled by Indo-Aryan peoples, and despite its long history of Islam, it was not recently a frontier society.
Anyway, that’s all I got to say for that. I’m sure readers will have more insight on this pattern than I do….
The geographic structure of these population transformations gave rise to population structure of present-day Europe. For example Anatolian Neolithic ancestry is highest in southern European populations like Sardinians, and lowest in northern European populations (38). Steppe ancestry is at high frequency in north-central Europeans and low in the south. Isolation-by-distance may have contributed to these patterns to some extent, but the contribution must have been small. In much of Europe, extreme population discontinuity was the norm.
Basically, they are contrasting pulse admixtures with continuous gene flow. One stylized model of the settling of the world after the “Out of Africa” migration is that most of the extant population structure was established by about ~20,000 years ago, and much of what has occurred since then has been divergence due to barriers to gene flow, as well as homogenization due to continuous gene flow.
Ancient DNA has basically overthrown that model. There is just too much turnover in some parts of the world in rapid succession for variation to have been patterned exclusively by continuous gene flow. On the other hand, some researchers have felt that pulse admixture is a little overemphasized in the current narrative, in part because it’s a good simplifying model for explaining the origins of daughter populations with roots in two or more parental groups (e.g., model-based clustering and Treemix both assume pulse admixture). That doesn’t mean that this is a correct description of reality, just that it is a tractable one. This sort of concern motivated papers such as A Spatial Framework for Understanding Population Structure and Admixture.
Of course, the “conflict” between people who accept pulse admixture and those who accept continuous gene flow is not a conflict at all. Really it is simply people as a whole attempting to get a better of sense of how frequent pulse admixtures are in the context of a demographic landscape of continuous gene flow. This isn’t the 1970s when selectionists and neutralists argued over small crumbs of data. There’s enough data to test a lot of alternatives and slowly but surely converge upon a consensus.
Which brings me to the question: are these dynamics relevant outside of humans? It strikes me that for plants and other sessile organisms we’d assume that continuous gene flow dominates. At the other extreme, you have birds…who are so mobile that I also believe that continuous gene flow dominates here also. In contrast, land-based tetrapods are much more mobile than plants, but often stymied by temporary barriers such as rivers or rising sea levels. So there would be more pulse admixtures, because continuous gene flow would be interrupted, and then perhaps the barrier would disappear, in which case rapid admixture would occur.
Humans are a curious cause because I believe one reason that pulse admixture might be more prevalent is that we we create our own barrier. Culture.
Mel Green co-taught a “history of genetics” course that I took as a first-year grad student at UC Davis. It was fitting because Mel Green was a living embodiment of the history of genetics. Mine was one of the last years that Mel co-taught that class, so I feel quite privileged.
Unlike some of my friends who have gone through Davis I only had a few conversations with Mel. But he gave us the wisdom of a life of learning and seeing genetics evolve as a discipline over the 20th century. It isn’t often that you talk to someone who could dismiss Charles Davenport because he had talked to the man and judged that he had a poor grasp of Mendelian theory!
Most everyone has a “Mel Green story.” So let me recount mine. Though it doesn’t have to do me with as such. Mel lived 101 years, and was active in science by the 1940s. In our history of genetics course we had to give a presentation on a particular topic (mine was on polytene chromosomes). The student who was giving the presentation on Drosophila research was not a genetics student. I had assumed she would be a bit nervous because Mel was a renowned Drosophilist, and he was sitting right there listening to everything.
At some point she began to refer to a researcher, “M Green.” She went on about “M Green” and his work for about five minutes, at one point pausing to note that “M Green” even worked at Davis! At this point the co-instructor had to stop her and tell her that “M Green” was sitting in the room, right next to her. Because the research was published in the 1940s the student had assumed that this was from someone who could never have been alive in the present. But there it was, Mel Green was still with us, a witness to all that history that had come and gone.
One of the major issues that confuses people is that the distribution of a trait or gene is often only weakly correlated with overall phylogeny and the rest of the genome.
To give a strange but classic example, the MHC loci are subject to strong balancing selection. This means that novel alleles do not substitute and replace ancestral alleles. Substitution of this sort results in “lineage sorting,” so that when you look at chimpanzees and humans you can see many polymorphic loci where all humans carry one variant and all chimpanzees the other. In contrast at the MHC loci there is frequency-dependent selection for rare variants, so the normal cycling process does not occur. Humans and chimpanzees overlap quite a bit on MHC, and any given human may have a more similar profile to a given chimpanzee than another human.
There are 19,000 human genes. At 3 billion base pairs only about ~100 million are polymorphic on a worldwide scale (using some liberal definitions). There are lots of unique stories to tell here.
A new preprint, Inferring adaptive gene-flow in recent African history, illustrates how certain genes with functional significance may differ from genome-wide background. The authors find that among the Fula (Fulani) people of West Africa there has been introgression from a Eurasian mutation that confers lactase persistence. The area of the genome around this gene is much more Eurasian than the rest of the genome. In contrast, the area around the Duffy allele is much less Eurasian. The variation in this locus is related to malaria resistance. Finally, in other African populations, they found gene flow of MHC variants.
None of this is entirely surprising, though the authors apply novel haplotype-based methods which should have wider utility.