Punjabi genetic variation in 1000 Genomes: Hindu caste in the Land of the Pure?

In the 1000 Genomes, there is a Punjabi dataset. Here is the description:

These cell lines and DNA samples were prepared from blood samples collected in Lahore, Pakistan. The samples are from a mix of parent- adult child trios and unrelated individuals who identified themselves and their parents as Punjabi.

A few years ago I did an analysis of the population structure in the 1000 Genomes dataset. In the Chinese data, there seemed to be some curious structure (there were two clusters of South Chinese). But the biggest issues predictably were in the South Asians. To give concrete examples, there were a few Brahmins in the Telugu data. A subset of Tamils and Telugus were highly ASI shifted. The Gujurati were highly heterogeneous, and one subcluster were almost certainly Patels (the samples were collected in Houston). The ASI shifted groups were almost certainly Scheduled Castes (Dalits) because I could see that they clustered with those samples from Estonian Biocentre dataset.

There was something curious about the samples from Pakistan and Bangladesh. Aside from a small number of individuals, whose samples were collected at the same time judging by their IDs (these individuals cluster with Scheduled Castes), the Bangladeshi sample didn’t have much South Asian style structure. That is, there wasn’t a cline or lots substructure within the ethnicity.

As noted by some commenters, the Punjabi samples were very different. Like the Gujurati samples, there was a huge variance along the ANI-ASI cline. To me, this was somewhat surprising. To make the 1000 Genomes more useful I used PCA and divided both Gujuratis and Punjabis into groups based on their position on the ANI-ASI cline. So that ANI_1 is the subpopulation with the most ANI and ANI_4 the least.

Using Treemix produced some weird results. As you can see above Punjabi_ANI_1 looks like an Iranian population with gene flow from Punjabi_ANI_3. Punjabi ANI_2 looks like a North Indian population with Iranian gene flow (so it is more ASI). Punjabi_ANI_3 are less ANI shifted than Uttar Pradesh Brahmins, but more than Uttar Pradesh Kshatriya. Finally, Punjabi_ANI_4 actually is very similar to Punjabi_ANI_2, except it has gene flow from a Dalit-like population.

With the South Asian Genotype Project I have a few Punjabi samples. All of them are within Punjabi_ANI_1.

I don’t know what’s going on here. Is this really caste-like structure in Punjab? Or are we see lots of admixture of people who are called “Punjabi” today? For example, the gene flow edges suggest lots of mixing between quite South Asian types of groups and an Iranian sort. Perhaps this is the absorption of Pathans into South Asian groups? Could it be Muhajir people who mixed with local Punjabis and identified as such?

I was curious to see if I could find something similar in relation to the three Jatts. As you can see with Treemix, no. Jatts are just very ANI-shifted. I added Lithuanians and Georgians, and you can see that Uttar Pradesh Brahmins get gene flow from a Lithuanian shifted group, while South Indian Brahmins have a more Georgian gene flow. This is just an artifact I suspect of the fact that South Indian Brahmins have a lot of admixture from non-Brahmin South Indians, who are more Georgian than Lithuanian (Iran_N as opposed to Yamnaya).

Finally, going back to the Bengali (Bangladeshi) vs. Punjabi contrast, it is really interesting. If Punjab has such deep caste-like structures it really goes to show how within South Asia caste is a very very powerful institution, and ~1,000 years of Muslim rule and in western Punjab a majority Muslim population did not break down the institution. In contrast, in Bangladesh, there doesn’t seem to be much caste structure. I am routinely the most East Asian shifted Bengali in datasets, but my family is also from the eastern edge of eastern Bengal. Why the difference?

in The Rise of Islam and the Bengal Frontier the author posits that the Islamicization of eastern Bengal was to a great extent the function of the opening up of lands for cultivation under the supervision of Muslim elites under the rule of Afghans and later Mughals. This would explain the lack of caste structure because presumably, caste structure would be difficult to maintain in a frontier landscape, where the cultural elite does not promote or accept caste (though the elite West Asian Muslims were racially exclusive, they were also a very small minority).

In contrast, the Punjab has long been settled by Indo-Aryan peoples, and despite its long history of Islam, it was not recently a frontier society.

Anyway, that’s all I got to say for that. I’m sure readers will have more insight on this pattern than I do….

Continuous gene flow vs. pulse admixture

In the new preprint Ancient genomics: a new view into human prehistory and evolution the authors write:

The geographic structure of these population transformations gave rise to population structure of present-day Europe. For example Anatolian Neolithic ancestry is highest in southern European populations like Sardinians, and lowest in northern European populations (38). Steppe ancestry is at high frequency in north-central Europeans and low in the south. Isolation-by-distance may have contributed to these patterns to some extent, but the contribution must have been small. In much of Europe, extreme population discontinuity was the norm.

Basically, they are contrasting pulse admixtures with continuous gene flow. One stylized model of the settling of the world after the “Out of Africa” migration is that most of the extant population structure was established by about ~20,000 years ago, and much of what has occurred since then has been divergence due to barriers to gene flow, as well as homogenization due to continuous gene flow.

Ancient DNA has basically overthrown that model. There is just too much turnover in some parts of the world in rapid succession for variation to have been patterned exclusively by continuous gene flow. On the other hand, some researchers have felt that pulse admixture is a little overemphasized in the current narrative, in part because it’s a good simplifying model for explaining the origins of daughter populations with roots in two or more parental groups (e.g., model-based clustering and Treemix both assume pulse admixture). That doesn’t mean that this is a correct description of reality, just that it is a tractable one. This sort of concern motivated papers such as A Spatial Framework for Understanding Population Structure and Admixture.

Of course, the “conflict” between people who accept pulse admixture and those who accept continuous gene flow is not a conflict at all. Really it is simply people as a whole attempting to get a better of sense of how frequent pulse admixtures are in the context of a demographic landscape of continuous gene flow. This isn’t the 1970s when selectionists and neutralists argued over small crumbs of data. There’s enough data to test a lot of alternatives and slowly but surely converge upon a consensus.

Which brings me to the question: are these dynamics relevant outside of humans? It strikes me that for plants and other sessile organisms we’d assume that continuous gene flow dominates. At the other extreme, you have birds…who are so mobile that I also believe that continuous gene flow dominates here also. In contrast, land-based tetrapods are much more mobile than plants, but often stymied by temporary barriers such as rivers or rising sea levels. So there would be more pulse admixtures, because continuous gene flow would be interrupted, and then perhaps the barrier would disappear, in which case rapid admixture would occur.

Humans are a curious cause because I believe one reason that pulse admixture might be more prevalent is that we we create our own barrier. Culture.

Recollections of Mel Green

Mel Green co-taught a “history of genetics” course that I took as a first-year grad student at UC Davis. It was fitting because Mel Green was a living embodiment of the history of genetics. Mine was one of the last years that Mel co-taught that class, so I feel quite privileged.

Unlike some of my friends who have gone through Davis I only had a few conversations with Mel. But he gave us the wisdom of a life of learning and seeing genetics evolve as a discipline over the 20th century. It isn’t often that you talk to someone who could dismiss Charles Davenport because he had talked to the man and judged that he had a poor grasp of Mendelian theory!

Most everyone has a “Mel Green story.” So let me recount mine. Though it doesn’t have to do me with as such. Mel lived 101 years, and was active in science by the 1940s. In our history of genetics course we had to give a presentation on a particular topic (mine was on polytene chromosomes). The student who was giving the presentation on Drosophila research was not a genetics student. I had assumed she would be a bit nervous because Mel was a renowned Drosophilist, and he was sitting right there listening to everything.

At some point she began to refer to a researcher, “M Green.” She went on about “M Green” and his work for about five minutes, at one point pausing to note that “M Green” even worked at Davis! At this point the co-instructor had to stop her and tell her that “M Green” was sitting in the room, right next to her. Because the research was published in the 1940s the student had assumed that this was from someone who could never have been alive in the present. But there it was, Mel Green was still with us, a witness to all that history that had come and gone.

Selection swimming against the genomic tide

One of the major issues that confuses people is that the distribution of a trait or gene is often only weakly correlated with overall phylogeny and the rest of the genome.

To give a strange but classic example, the MHC loci are subject to strong balancing selection. This means that novel alleles do not substitute and replace ancestral alleles. Substitution of this sort results in “lineage sorting,” so that when you look at chimpanzees and humans you can see many polymorphic loci where all humans carry one variant and all chimpanzees the other. In contrast at the MHC loci there is frequency-dependent selection for rare variants, so the normal cycling process does not occur. Humans and chimpanzees overlap quite a bit on MHC, and any given human may have a more similar profile to a given chimpanzee than another human.

There are 19,000 human genes. At 3 billion base pairs only about ~100 million are polymorphic on a worldwide scale (using some liberal definitions). There are lots of unique stories to tell here.

A new preprint, Inferring adaptive gene-flow in recent African history, illustrates how certain genes with functional significance may differ from genome-wide background. The authors find that among the Fula (Fulani) people of West Africa there has been introgression from a Eurasian mutation that confers lactase persistence. The area of the genome around this gene is much more Eurasian than the rest of the genome. In contrast, the area around the Duffy allele is much less Eurasian. The variation in this locus is related to malaria resistance. Finally, in other African populations, they found gene flow of MHC variants.

None of this is entirely surprising, though the authors apply novel haplotype-based methods which should have wider utility.

The non-European ancestry of Afrikaners

A few years ago I got some South African genotypes. Some of the individuals were clearly African. A few mapped perfectly upon Northern Europeans. But many of the samples consistently were European but shifted toward non-European populations.

Based on history of the assimilation of slaves into the European population of Cape Colony in the 18th century, my assumption is that these individuals are Afrikaners.

Recently I realized that Brenna Henn had released some more Khoisan samples, so I decided to look at this question of admixture again. The two Khoisan populations are the Nama and the Khomani. I removed those with lots of Bantu and European admixture and combined them together into one population.

Running unsupervised Admixture shows how distinct the South African whites are.

The average Utah white in this sample (this population is a mix of British, German, and Scandinavian in ancestry) is 99% European modal cluster, and 1% South Asian. The average for the white South Africans in this data set is 94% European modal cluster. The residual is 1% East Asian (Dai modal), 1% Khosian, 1% non-Khoisan African, and 2% South Asian.

I ran Treemix a bunch of times, and every single plot came out like this when I ran it for three migrations:


The gene flow from the Utah whites to the Gujuratis is simply an artifact of the fact that the Gujurati sample is mixed caste, and some of the Brahmin or Lohannas have more “Ancestral North Indian.” The gene flow from the Europeans to the Khoisan is probably real, or, might be due to pastoralist admixture via East Africans. The last migration arrow goes from the African populations to the South African whites, with a shift toward the Khoisan.

I also ran a three population test where A is the outgroup, and B and C are a clade. A significantly negative f3-statistic indicates admixture in population A. The negative values are listed below:

A B C f3 f3-error Z-score
Gujrati Dai UtahWhite -0.00121718 0.000140141 -8.68539
South_Africa EsanNigeria UtahWhite -0.00127718 0.000147982 -8.63059
South_Africa Khoisan_SA UtahWhite -0.0012928 0.000151416 -8.53802
Gujrati South_Africa Dai -0.000778791 0.000155656 -5.00329
South_Africa Dai UtahWhite -0.000541974 0.000133262 -4.06699
South_Africa UtahWhite Gujrati -0.000103581 8.46193e-05 -1.22408

This aligns well with the Admixture results. Afrikaners have both African ancestries, and, Asian ancestry.

In James Michener’s The Covenant one of the plot lines alludes to mixed ancestry in one of the Afrikaner families. The results above suggest that mixed ancestry is very common, and perhaps ubiquitous, in this population. True, there are some Afrikaners such as Hendrik Verwoerd who migrated to South Africa from the Netherlands in the past century or so, but these are uncommon to my knowledge.

After agriculture, before bronze


The above plot shows genetic distance/variation between highland and lowland populations in Papa New Guinea (PNG). It is from a paper in Science that I have been anticipating for a few months (I talked to the first author at SMBE), A Neolithic expansion, but strong genetic structure, in the independent history of New Guinea.

What does “strong genetic structure” mean? Basically Fst is showing the proportion of genetic variation which is partitioned between groups. Intuitively it is easy to understand, in that if ~1% of the genetic variation is partitioned between groups in one case, and ~10% in another, then it is reasonable to suppose that the genetic distance between groups in the second case is larger than in the first case. On a continental scale Fst between populations is often on the order of ~0.10. That is the value for example when you pool the variation amongst Northern Europeans and Chinese, and assess how much of it can be apportioned in a manner which differentiates populations (so it’s about ~10% of the variation).

This is why ancient DNA results which reported that Mesolithic hunter-gatherers and Neolithic farmers in Central Europe who coexisted in rough proximity for thousands of years exhibited differences on the order of ~0.10 elicited surprise. These are values we are now expecting from continental-scale comparisons. Perhaps an appropriate analogy might be the coexistence of Pygmy groups and Bantu agriculturalists? Though there is some gene flow, the two populations exist in symbiosis and exhibit local ecological segregation.

In PNG continental scale Fst values are also seen among indigenous people. The differences between the peoples who live in the highlands and lowlands of PNG are equivalent to those between huge regions of Eurasia. This is not entirely surprising because there has been non-trivial gene flow into lowland populations from Austronesian groups, such as the Lapita culture. Many lowland groups even speak Austronesian languages today.

Using standard ADMIXTURE analysis the paper shows that many lowland groups have significant East Asian ancestry (red), while none of the highland groups do (some individuals with East Asian admixture seem to be due to very recent gene flow). But even within the highlands the genetic differences are striking. The  Fst values between Finns and Southern European groups such as Spaniards are very high in a European context (due to Finnish Siberian ancestry as well as drift through a bottleneck), but most comparisons within the highland groups in PNG still exceeds this.

The paper also argues that genetic differences between Papuans and the natives of Australia pre-date the rising sea levels at the beginning of the Holocene, when Sahul divided between its various constituents. This is not entirely surprising considering that the ecology of the highlands during the Pleistocene would have been considerably different from Australia to the south, resulting in sharp differences in the hunter-gatherer lifestyles. Additionally, there does not seem to have been a genetic cline. Papuans are symmetrically related to all Australian groups they had samples from.

Using coalescence-based genomic methods they inferred that separation between highlands and some lowland groups occurred ~10-20,000 years ago. That is, after the Last Glacial Maximum. For the highlands, the differences seem to date to within the last 10,000 years. The Holocene. Additionally, they see population increases in the highlands, correlating with the shift to agriculture (cultivation of taro).

None of the above is entirely surprising, though I would take the date inferences with a grain of salt. The key is to observe that large genetic differences, as well as cultural differences, accrued in the highlands of PNG during the Holocene. In the paper they have a social and cultural explanation for what’s going on:

  Fst values in PNG fall between those of hunter-gatherers and present-day populations of west Eurasia, suggesting that a transition to cultivation alone does not necessarily lead to genetic homogenization.

A key difference might be that PNG had no Bronze Age, which in west Eurasia was driven by an expansion of herders and led to massive population replacement, admixture, and cultural and linguistic change (7, 8), or Iron Age such as that linked to the expansion of Bantu-speaking
farmers in Africa (24). Such cultural events have resulted in rapid Y-chromosome lineage expansions due to increased male reproductive variance (25), but we consistently find no evidence for this in PNG (fig. S13). Thus, in PNG, wemay be seeing the genetic, linguistic, and cultural diversity that sedentary human societies can achieve in the absence of massive technology-driven expansions.

Peter Turchin in books like Ultrasociety has aruged that one of the theses in Steven Pinker’s The Better Angels of Our Nature is incorrect: that violence has not decreased monotonically, but peaked in less complex agricultural societies. PNG is clearly a case of this, as endemic warfare was a feature of highland societies when they encountered Europeans. Lawrence Keeley’s War Before Civilization: The Myth of the Peaceful Savage gives so much attention to highland PNG because it is a contemporary illustration of a Neolithic society which until recently had not developed state-level institutions.

What papers like these are showing is that cultural and anthropological dynamics strongly shape the nature of genetic variation among humans. Simple models which assume as a null hypothesis that gene flow occurs through diffusion processes across a landscape where only geographic obstacles are relevant simply do not capture enough of the dynamic. Human cultures strongly shape the nature of interactions, and therefore the genetic variation we see around us.

Quantitative genomics, adaptation, and cognitive phenotypes

The human brain utilizes about ~20% of the calories you take in per day. It’s a large and metabolically expensive organ. Because of this fact there are lots of evolutionary models which focus on the brain. In Catching Fire: How Cooking Made Us Human Richard Wrangham suggests that our need for calories to feed our brain is one reason we started to use fire to pre-digest our food. In The Mating Mind Geoffrey Miller seems to suggest that all the things our big complex brain does allows for a signaling of mutational load. And in Grooming, Gossip, and the Evolution of Language Robin Dunbar suggests that it’s social complexity which is driving our encephalization.

These are all theories. Interesting hypotheses and models. But how do we test them? A new preprint on bioRxiv is useful because it shows how cutting-edge methods from evolutionary genomics can be used to explore questions relating to cognitive neuroscience and pyschopathology, Polygenic selection underlies evolution of human brain structure and behavioral traits:

…Leveraging publicly available data of unprecedented sample size, we studied twenty-five traits (i.e., ten neuropsychiatric disorders, three personality traits, total intracranial volume, seven subcortical brain structure volume traits, and four complex traits without neuropsychiatric associations) for evidence of several different signatures of selection over a range of evolutionary time scales. Consistent with the largely polygenic architecture of neuropsychiatric traits, we found no enrichment of trait-associated single-nucleotide polymorphisms (SNPs) in regions of the genome that underwent classical selective sweeps (i.e., events which would have driven selected alleles to near fixation). However, we discovered that SNPs associated with some, but not all, behaviors and brain structure volumes are enriched in genomic regions under selection since divergence from Neanderthals ~600,000 years ago, and show further evidence for signatures of ancient and recent polygenic adaptation. Individual subcortical brain structure volumes demonstrate genome-wide evidence in support of a mosaic theory of brain evolution while total intracranial volume and height appear to share evolutionary constraints consistent with concerted evolution…our results suggest that alleles associated with neuropsychiatric, behavioral, and brain volume phenotypes have experienced both ancient and recent polygenic adaptation in human evolution, acting through neurodevelopmental and immune-mediated pathways.

The preprint takes a kitchen-sink approach, throwing a lot of methods of selection at the phenotype of interest. Also, there is always the issue of cryptic population structure generating false positive associations, but they try to address it in the preprint. I am somewhat confused by this passage though:

Paleobiological evidence indicates that the size of the human skull has expanded massively over the last 200,000 years, likely mirroring increases in brain size.

From what I know human cranial sizes leveled off in growth ~200,000 years ago, peaked ~30,000 years ago, and have declined ever since then. That being said, they find signatures of selection around genes associated with ‘intracranial volume.’

There are loads of results using different methods in the paper, but I was curious note that schizophrenia had hits for ancient and recent adaptation. A friend who is a psychologist pointed out to me that when you look within families “unaffected” siblings of schizophrenics often exhibit deviation from the norm in various ways too; so even if they are not impacted by the disease, they are somewhere along a spectrum of ‘wild type’ to schizophrenic. In any case in this paper they found recent selection for alleles ‘protective’ of schizophrenia.

There are lots of theories one could spin out of that singular result. But I’ll just leave you with the fact that when you have a quantitative trait with lots of heritable variation it seems unlikely it’s been subject to a long period of unidirecitional selection. Various forms of balancing selection seem to be at work here, and we’re only in the early stages of understanding what’s going on. Genuine comprehension will require:

– attention to population genetic theory
– large genomic data sets from a wide array of populations
– novel methods developed by population genomicists
– and funcitonal insights which neuroscientists can bring to the table

South Asian gene flow into Burmese and Malays?

I happen to have a data set merged from the 1000 Genomes and Estonian Biocentre which has Malays, Burmans, and other assorted Southeast Asians, East Asians, and South Asians. In light of recent posts I thought I would throw out something in relation to this data set (you can download the data here). Above you can see the populations in the data. You see Bangladeshis consistently are shifted toward Southeast Asians in comparison to other South Asians. But both Burmans and Malays exhibit some shift toward South Asians.

I ran ADMIXTURE at K = 4. Click the image for the larger file which shows the populations, but I will tell you what’s going on.

The yellow to green represent a north-south axis in East Asia. The Han sample is mostly yellow, but there is a green component in varying degrees. This almost certainly represents heterogeneity in the Han sample of north to south Chinese. The green component is nearly ~100% in some individuals from indigenous tribes in Borneo, and balanced with the yellow among peninsular Malays. It is more at a higher frequency in Cambodia than in Vietnam or Burma, indicating the older roots of Khmers and their relative insulation from later migrations of Sino-Tibetan and Tai peoples.

The red South Asian component is found in many Southeast Asians, but curious in the Burmans and Malays there is a lot of variation within the population. That indicates admixture over time that has not homogenized throughout the population.

I ran Treemix with 5 migration edges and French rooted (1000 SNP blocks out of 225,000 SNPs) and they all looked like this. Commentary I will leave to readers….

Genetics books for the masses!

Since I’ve become professionally immersed in genetics I haven’t read many books on the topics. I read papers. And I do genetics. But back in the day I did enjoy a good book. The standard recommendation would be to read Matt Ridley’s Genome. It’s a bit dated now (it was published around when the Human Genome Project being completed), but I’d still recommend it.

But when in the mid-2000s I dabbled a little bit in the world of worm (C. elegans) genetics I read Andrew Brown’s In the Beginning Was the Worm: Finding the Secrets of Life in a Tiny Hermaphrodite. It’s pretty far from my current concerns and fixations, with more of a focus on developmental processes, but it is pretty cool to read about the race to “map” every cell in C. elegans.

The second book I’d recommend readers of this blog is the late Will Provine’s The Origins of Theoretical Population Genetics. Modern population genomics is a massive edifice built atop the foundations of the early 20th century fusion of Mendelism and the biometrical heirs of Darwin. Provine outlines how primitive genetics eventually seeded the birth of the Neo-Darwinian Synthesis.