Pakistani British are very much like Indians genetically

I talked to Joe Henrich this week for The Insight about his book, The WEIRDest People in the World: How the West Became Psychologically Peculiar and Particularly Prosperous (the episode is going life next week). Obviously much of the discussion hinged around relatedness, kinship, and how that impacted the arc of history (we also talked about other issues, such as the status of the “big gods” debate, so most definitely tune in!).

So I was very curious when I saw a new preprint, Fine-scale population structure and demographic history of British Pakistanis:

Previous genetic and public health research in the Pakistani population has focused on the role of consanguinity in increasing recessive disease risk, but little is known about its recent population history or the effects of endogamy. Here, we investigate fine-scale population structure, history and consanguinity patterns using genetic and questionnaire data from >4,000 British Pakistani individuals, mostly with roots in Azad Kashmir and Punjab. We reveal strong recent population structure driven by the biraderi social stratification system. We find that all subgroups have had low effective population sizes (Ne) over the last 50 generations, with some showing a decrease in Ne 15-20 generations ago that has resulted in extensive identity-by-descent sharing and increased homozygosity. Using new theory, we show that the footprint of regions of homozygosity in the two largest subgroups is about twice that expected naively based on the self-reported consanguinity rates and the inferred historical Ne trajectory. These results demonstrate the impact of the cultural practices of endogamy and consanguinity on population structure and genomic diversity in British Pakistanis, and have important implications for medical genetic studies.

None of this is entirely surprising. The media in the UK has written about recessive disease load because of cousin-marriage amongst Pakistani Britons. But there are also things in the preprint that need to be made explicit. The “biraderi” social system is apparently a paternal lineage system in the northwest of the Indian subcontinent which transcends religion (i.e., it is present across the border in Indian Punjab). These are “tribal” or “clan” societies in a way that is not present across much of the Indian subcontinent. For example, my family is from eastern Bengal. Before the partition between India and Pakistan, the far northwest and northeast of the subcontinent had the highest proportions of Muslims. But that did not mean that the two regions were culturally very similar, explaining in part the war in 1971 that resulted in Bangladesh. In Bangladesh, biraderi is not known, and the rates of cousin-marriage are much lower than in Pakistan.

One of the things I immediately noticed in the 1000 Genomes data is that Bangladeshis exhibit a lot less structure and stratification than Indians and the samples from Pakistani Punjab. In many ways, the patterns in the Bangladeshi genomes resemble the type of patterns in non-South Asian genomes: an outbreeding population without much internal structure.

This is not typical in South Asia. Rather, Indian populations tend to have lots of differences between jati/caste groups due to endogamy. To my surprise, Pakistani samples from Lahore were similar, though I attributed some of that to the migration of people from India after 1947 (a similar pattern does not hold for Bangladesh, as only a small number of people migrated from India). Additionally, the runs of homozygosity among Pakistani populations indicated lots of consanguineous marriages. While some South Indians marry cousins, the practice is very rare among North Indian Hindus. Rather, the genetic homogeneity of North Indian Hindus is due to the very high endogamy rates. They do not marry outside of their caste.

The results from the British Pakistanis are roughly in line with the 1000 Genome Pakistanis, but in this case, the researchers had much more granular ethnic data, as well as information on whether individuals were or were not the product of cousin-marriages. In terms of worldwide population affinity, there isn’t a great surprise. The Pathans, who are Iranian speaking, were distinct. The groups with putative Arab ancestry (Syeds), did not seem to have much of that (really, any).

The figure above shows the long-term effective population size patterns. Within the preprint the authors note that these northwest Indian populations began to diverge ~2,000 years ago. That is roughly in line with what Moorjani et al. found for their Indian samples. This tells us that these Pakistani populations were part of the same cultural milieu as Hindu populations in India itself, whose caste endogamy did not seem to crystallized until about that time. This also seems to run against the thesis presented by some Pakistani nationalists that the northwestern populations were very distinctive “non-Hindu” mlecchas. Al-Biruni and earlier observers identified caste as distinctively Indian, and the likelihood of population structure emerging at the same period in the northwest indicates that these people are broadly part of that milieu.

But I want to focus on the more recent period. Using various methods the authors estimate that the effective population sizes of many of these groups dropped 10-20 generations ago. If you assume 10 generations with generation times of 15 years, that’s 150 years. If you assume 20 generations with generation times of 25 years, that gives you 500 years. So let’s take that as our interval. What’s going on here? I think what this may illustrate is the spread of Muslim practices among Islamicized peoples of the northwest.

In my podcast with Henrich he mentions that Islamic societies are peculiar in their ubiquitous practice of “parallel-cousin-marriage.” This means that brothers will marry their children off to each other (a contrast with “cross-cousin-marriage”, common in South India, where brothers and sisters marry their children to each other). The ubiquity of cousin-marriage among Pakistani Muslims is a contrast with genetically and culturally similar populations across the border in India (Indian Punjabis do not marry cousins if Sikh or Hindu).

Click to enlarge

The fact that this practice occurred among an endogamous group for many generations has consequences. The figure to the right illustrates just how homogeneous some of these groups are against a generic European reference population. And, the fact that even unrelated individuals from the same biraderi group are often quite related. As you can see even people whose parents are unrelated still exhibit excess runs of homozygosity. This is simply a function of pedigrees being narrow, as just in Indian castes these individuals share many not-so-recent-ancestors.

A positive note is that this high level of inbreeding does not apply to Pakistani Britons where both parents were born in the country. That means that biraderi dynamics are maintained due to continuous migration from Pakistan. They’re not perpetuating themselves in the UK.

I started this post with Joe Henrich for a reason: if Henrich is correct that the differences in social structure and relatedness matter for development and economists, then Pakistan and Bangladesh might have different trajectories. Bangladesh is a corrupt and familialist society, just like Pakistan. But, that familialism is not as robust and articulated as is the norm in Pakistan. A transition to a more high-trust and non-familial society is more viable and an easier lift for a non-tribal culture where clans do not extend much beyond first cousins.

20th century genetics as basic science and 21st century genetics as basic and applied

There was an offhand comment on Twitter that in the 1970s genetics was barely a field because we’ve made so much progress since then. For obvious reasons, many scientists took umbrage at this. I think it’s wrong and gives the lay public the incorrect impression. But, the reality is that I do think that the way the media and some geneticists have presented the development of the field since the understanding of DNA as the substrate of inheritance in the 1950s and the explosion of genomics in the 2000s has fed into this misimpression.

What’s the truth? Genetics predates genomics by a century or more, and DNA by decades. The basics of the field were elucidated by Gregor Mendel in the 1860s. He originated the “laws of inheritance”, though unfortunately his work was ignored by contemporaries. By “laws of inheritance” I mean that Mendel formulated an analytic model that allowed for discrete inheritance and predictions of the outcome of that inheritance. Naive human understanding of heritability usually relies on an intuitive “blended theory”. It works, after a fashion, but it does not explain many patterns we see around us (e.g., recessive expression).

Charles Darwin famously relied upon blended inheritance (in part) as a basis for the heritability which was essential to his theory of natural selection. But, a major problem with blended inheritance is that blending removes variation as everyone becomes a similar “mix”. This is not an issue with Mendelian inheritance, which is discrete. Alleles do not “mix”, but reconfigure every generation. Variation is retained. The “math” of evolution “works” in this manner.

The utility of Mendelian genetics is why the field exploded in the first two decades of the 20th century. Read A. H. Sturdevant’s 1913 paper on the first genetic map. I think it gives you a flavor of the rate of advancement. Genetics was definitely a field. In The Origins of Theoretical Population Genetics Will Provine outlines how this particular field of genetics developed between 1920 and 1940 to become the core of evolutionary biology. Again, this suggests that even before DNA genetics was an important field.

But, I do think it is fair to say that before 1950 genetics was very much “basic” science, and remained mostly so to the last decades of the 20th century.* DNA was interesting because it opened up the molecular biology revolution, but that had a very long fuse in terms of applications. PCR made it easier to do DNA testing, while new computing technologies made it much easier to generate and analyze data.

No one needs to be told about how genomics revolutionized genetics. But it’s major impact has been transforming an often theoretical field into a massively empirical one. Modern genomics is still underpinned by the logic of Mendelian genetics. Analysis.

* The main exception here I’m going to make is for agricultural genetics, but much of this work doesn’t need “genes” as such.

The great Chinese genetic database

China Is Collecting DNA From Tens of Millions of Men and Boys, Using U.S. Equipment:

The police in China are collecting blood samples from men and boys from across the country to build a genetic map of its roughly 700 million males, giving the authorities a powerful new tool for their emerging high-tech surveillance state.

They have swept across the country since late 2017 to collect enough samples to build a vast DNA database, according to a new study published on Wednesday by the Australian Strategic Policy Institute, a research organization, based on documents also reviewed by The New York Times. With this database, the authorities would be able to track down a man’s male relatives using only that man’s blood, saliva or other genetic material.

An American company, Thermo Fisher, is helping: The Massachusetts company has sold testing kits to the Chinese police tailored to their specifications. American lawmakers have criticized Thermo Fisher for selling equipment to the Chinese authorities, but the company has defended its business.

I don’t have much to say, though you should read the piece. This is a vision of a particular future. I am obviously concerned, but I think I have to frankly “grade on a curve” here. The Chinese state already has an incredible amount of power and control over its citizens. The genetic angle is much more of a movement on the margins than a qualitative change in anything. Genes are not magic, but phone tracking is.

As for the involvement of American companies, I don’t know what to say. Have you stopped buying Chinese products?

The Facts About Elizabeth Warren’s DNA test

With Warren dropping out of the race for the Democratic nomination a lot of people on podcasts I listen to are making fun of her DNA test. Unfortunately, there are some falsehoods being promoted. It’s kind of scary for me because this is a field I know well, and it’s disturbing to watch falsehoods becoming accepted truths because people repeat them over and over again.

– First, the DNA test was not done through 23andMe, etc., or any standard commercial service. Rather, it was done by the Bustamante group at Stanford. This group has a lot of experience with the genetics of indigenous peoples of the Americas, so that is presumably why they were approached.

– Second, Warren surely has more than the expected amount of ancestry (for a white American) derived from people who were resident in the New World prior to 1492. The Bustamante group used relatively stringent criteria that are not comparable in an apples-to-apples manner with the inferences of 23andMe.

I am not here addressing the issue of whether she is or isn’t a Cherokee, or descended from Cherokees. The tests can’t answer those questions for both scientific and socio-political reasons. I’m also not addressing whether she used her identification with that tribe in furthering her career.

My only point in putting this post up is that it gets really disturbing to see “pundits” repeating “facts” you know are totally wrong without any malice because the information ecosystem is such that false facts rapidly transmute into conventional wisdom. Basically, when you see this happening you start to disbelieve everything…

Here is an old post, Elizabeth Warren Carries Native American DNA – She’s Running!.

Note: I have a piece about personal genomics that should be in the print edition of National Review in early April. Update, it’s up.

Jon Snow and Daenerys Targaryen are genetically similar to full-siblings or mother and son

I’ve posted on this before. So I will post again just to reiterate something: in terms of genes, Jon Snow and Daenerys Targaryen are much closer to being full-siblings than they are to being aunt and nephew.

You get different numbers depending on how deeply you look at the pedigree of the two. But their relatedness is probably above 40% and below 50%. Others have confirmed:

The verbal reason without math and genealogies is simple: Daenerys Targaryen comes from an inbred lineage, and more importantly, two generations of brother-sister marriages. This means that Daenerys Targaryen and Rhaegar Targaryen were genetically much more similar than typical full-siblings. Because of this, Jon Snow and Daenerys Targaryen are much more genetically similar than typical aunts and nephews, because Jon Snow’s father was to a first approximation genetically a male and older version of Daenerys Targaryen.

Sydney Brenner: the passing of a giant

Nobel laureate Sydney Brenner, who helped place Singapore on biotech world stage, dies at 92. If you are in genetics and development you know who Brenner his, and what he meant to these fields. I happened to be in the room with Brenner once, in Berkeley in 2008 I believe. He was already quite an old curmudgeon, and I will say his comments were amusing and awkward!

Long-time readers of this weblog know that about fifteen years ago I dabbled in a little worm-work. At that time I read In the Beginning Was the Worm: Finding the Secrets of Life in a Tiny Hermaphrodite. As Brenner was involved in promoting C. elegans as a model he occupies a lot of this book. I recommend it. It’s short and packed with historical nuggets that make the 21st-century trajectory of science more comprehensible.

Swidden rice farming does not lead to high population density

Admixture on K = 5

I’ve been looking at the data from the recent Munda paper. Standard stuff, admixture, treemix, and f-statistics.The northern Munda samples were collected in Bangladesh. So I thought: I can test the hypothesis that the East Asian ancestry in Bangladesh is to a large part Santhal. After looking at it every which way, I think that in fact, the Munda may not have ever been very populous in much of northeast India. The Santhal is just not a good donor population to Bengalis, at least not when comparing mixes such as Dai + Tamil.

Additionally, the Santhal are really not that well modeled by mixing South Asians with any particular Southeast Asian group, though it works. I think that’s suggestive of the possibility that the Austro-Asiatic group which gave rise to the Munda don’t exist in their current form anywhere in Southeast Asia. Additionally, the Lao samples that are provided in the new paper I think may have Indian ancestry via admixture from Austro-Asiatic Mon or Khmer groups.

Basically, there is so much bidirectional gene flow that I think it’s really hard to get a grip on what’s going on. Additionally, the Burmese and northeast Indian populations (e.g., the Mizos) clearly have a strand of ancestry that derives from relatively recent migrants that came down from the region of eastern Tibet, and perhaps Sichuan or even further north. And this component shows up in Bengalis as well.

On top of this, there is the “Australo-Melanesian” substrate that is present all across Southeast Asia, and probably was present in modern southern China in the early Holocene, which has distant affinities with the “Ancient Ancestral South Indians” (AASI).

At this point, I keep my own counsel. But there may be an interesting story to tell related to how efficient and effective different forms of agriculture were, and how that interplayed with genes and language.

Running AdmixTools through R – admixr

One of the reasons that I don’t post AdmixTools results too much is that the framework requires more statistical “deep thought” than just popping out a PCA or even running some model-based clustering. Read the methods supplements of one of the Reich lab’s papers, and you’ll see what I’m getting at. But a more prosaic reason is that I generally work in the plink format, and format conversion, as well as editing parameter files, is a pain. In general, I don’t do much “exploratory AdmixTools” stuff for a reason.

Martin Petr has made the second excuse a lot less of an excuse. His admixr package gives one an easy interface into AdmixTools. In particular, it allows one not to have to edit parameter files so much. It took me about ~15 minutes to get it downloaded and running. I’m on a Mac and for R use RStudio.

– remember to install wget if you are on a Mac (this will show up if you want to use online datasets)

– You need to make sure to set the path to AdmixTools. In the RStudio console, I just entered:

Sys.setenv("PATH"="~/MyPath/To/AdmixTools/bin/")

If you can get AdmixTools installed in the first place, admixr should be very easy.

Continuous gene flow vs. pulse admixture

In the new preprint Ancient genomics: a new view into human prehistory and evolution the authors write:

The geographic structure of these population transformations gave rise to population structure of present-day Europe. For example Anatolian Neolithic ancestry is highest in southern European populations like Sardinians, and lowest in northern European populations (38). Steppe ancestry is at high frequency in north-central Europeans and low in the south. Isolation-by-distance may have contributed to these patterns to some extent, but the contribution must have been small. In much of Europe, extreme population discontinuity was the norm.

Basically, they are contrasting pulse admixtures with continuous gene flow. One stylized model of the settling of the world after the “Out of Africa” migration is that most of the extant population structure was established by about ~20,000 years ago, and much of what has occurred since then has been divergence due to barriers to gene flow, as well as homogenization due to continuous gene flow.

Ancient DNA has basically overthrown that model. There is just too much turnover in some parts of the world in rapid succession for variation to have been patterned exclusively by continuous gene flow. On the other hand, some researchers have felt that pulse admixture is a little overemphasized in the current narrative, in part because it’s a good simplifying model for explaining the origins of daughter populations with roots in two or more parental groups (e.g., model-based clustering and Treemix both assume pulse admixture). That doesn’t mean that this is a correct description of reality, just that it is a tractable one. This sort of concern motivated papers such as A Spatial Framework for Understanding Population Structure and Admixture.

Of course, the “conflict” between people who accept pulse admixture and those who accept continuous gene flow is not a conflict at all. Really it is simply people as a whole attempting to get a better of sense of how frequent pulse admixtures are in the context of a demographic landscape of continuous gene flow. This isn’t the 1970s when selectionists and neutralists argued over small crumbs of data. There’s enough data to test a lot of alternatives and slowly but surely converge upon a consensus.

Which brings me to the question: are these dynamics relevant outside of humans? It strikes me that for plants and other sessile organisms we’d assume that continuous gene flow dominates. At the other extreme, you have birds…who are so mobile that I also believe that continuous gene flow dominates here also. In contrast, land-based tetrapods are much more mobile than plants, but often stymied by temporary barriers such as rivers or rising sea levels. So there would be more pulse admixtures, because continuous gene flow would be interrupted, and then perhaps the barrier would disappear, in which case rapid admixture would occur.

Humans are a curious cause because I believe one reason that pulse admixture might be more prevalent is that we we create our own barrier. Culture.