Why my Substack posts are better and worse than ancestry calculators


Of all my Substack posts, Ashkenazi Jewish genetics: a match made in the Mediterranean has been the most popular of the paid posts. It prompted this response from a reader:

The issue here is that my Substack is doing something different than what personal genomics companies are trying to do. My Substsack post is giving a survey of a whole population and its history, a personal genomics test is trying to give an individual estimate that is intelligible. When 23andMe or the other companies tell you are are 99% “Ashkenazi Jewish” it is simply giving you confirmation that you’re within the range of variation typical for Ashkenazi Jews (there is some suspicions from genealogy enthusiasts that 23andMe smooths out differences between Galicianers and Litvaks, for example).

Imagine that 23andMe told its Jewish customers that they were 45.3% Northern Levantine, 40% Southwest European, and 9.7% Northern European. How would they interpret it? Sophisticated users would understand this points to a deep history of admixture, but most users are not sophisticated. They want to know that they’re Ashkenazi Jewish, and how Jewish they are (most will be nearly 100%, but some people may have non-Ashkenazi cryptic ancestry).

When I worked for Embark Vet. one of the issues that the canine DNA test was having is that we were looking for wolf ancestry in dogs, as some customers with F1’s or backcrosses wanted to test their pooch. But, it turned out that you had to be careful because some northern Arctic dog breeds kept coming back with “wolf” ancestry at low fractions because they did have ancient wolf ancestry. But that’s not what the wolf test was designed to pick up.

Tibetans as the compound of two populations

A new paper looks at some ancient Tibetan genomes:

Present-day Tibetans have adapted both genetically and culturally to the high altitude environment of the Tibetan Plateau, but fundamental questions about their origins remain unanswered. Recent archaeological and genetic research suggests the presence of an early population on the Plateau within the past 40 thousand years, followed by the arrival of subsequent groups within the past 10 thousand years. Here, we obtain new genome-wide data for 33 ancient individuals from high elevation sites on the southern fringe of the Tibetan Plateau in Nepal, who we show are most closely related to present-day Tibetans. They derive most of their ancestry from groups related to Late Neolithic populations at the northeastern edge of the Tibetan Plateau but also harbor a minor genetic component from a distinct and deep Paleolithic Eurasian ancestry. In contrast to their Tibetan neighbors, present-day non-Tibetan Tibeto-Burman speakers living at mid-elevations along the southern and eastern margins of the Plateau form a genetic cline that reflects a distinct genetic history. Finally, a comparison between ancient and present-day highlanders confirms ongoing positive selection of high altitude adaptive alleles.

Y haplogroup D is found at high frequencies in Japan, Tibet, and the Andaman Islands. It strikes me this is evidence of a Paleolithic substrate, though the graph above shows that it diverged really deep in Eurasia, and in the text they say it’s only in Tibet.

The Tibetan ancestry seems to have been found in the Himalayan zone by 1450 BC, so rather early.

What Sahul tells us about world genetic history


The paper Papua New Guinean Genomes Reveal the Complex Settlement of North Sahul came out a few months ago. It’s fine. But one thing jumped out at me:

All estimated effective population size curves show a bottleneck around 60–70 ka, as found for European and Asian populations (supplementary fig. 16, Supplementary Material online). Around 50 ka, estimated effective population sizes for populations from Wallacea and Sahul are between 1,671 and 2,100 individuals, and between 2,540 and 2,999 individuals in Eurasia and 9,506 individuals in Africa. All divergence dates for Wallacea/Sahul groups from Africa, represented by Yoruba genomes, show similar results (74 ± 4.2 ka, supplementary fig. 17, Supplementary Material online). We note here that this date is older than that obtained between Eurasian and African genomes (66.5 ± 3.7 ka), a result previously reported and interpreted as a potential methodological bias or the signal of an even earlier human migration from Africa (Malaspinas et al. 2016; Mallick et al. 2016; Pagani et al. 2016; Bergstrom et al. 2017).

First, the bottleneck precedes the massive expansion after 60 ka. Second, perhaps there was Eurasian back-migration into Africa and that’s impacting the difference coalescence dates? Basically, the Yoruba can be thought as a population strongly influenced by “out-of-Africa” gene flow.

Pausing research on autism (for now)

High-profile autism genetics project paused amid backlash:

But soon after the study’s high-profile launch on 24 August, autistic people and some ASD researchers expressed concern that it had gone ahead without meaningfully consulting the autism community. Fears about the sharing of genetic data and an alleged failure to properly explain the benefits of the research have been raised by a group called Boycott Spectrum 10K, which is led by autistic people. The group plans to protest outside the ARC premises in Cambridge in October. A separate petition against the study gathered more than 5,000 signatures.

Damian Milton, a researcher in intellectual and developmental disabilities at the University of Kent in Canterbury, UK, is one of those who signed the Boycott Spectrum 10K petition. Milton has been diagnosed with Asperger’s syndrome, a form of ASD. He says it is not clear how the study will improve participants’ well-being, and its “aim seems to be more about collecting DNA samples and data sharing”.

As a result of the backlash, the Spectrum 10K team paused the study on 10 September, apologized for causing distress, and promised a deeper consultation with autistic people and their families.

I assume they’ll restart, but this sort of research will happen somewhere. Autism is a reasonably heritable trait, and many of the people with autism are not “high functioning.”

G allele at Rs10774671 protects against severe COVID-19

A new paper digs into OAS1, A prenylated dsRNA sensor protects against severe COVID-19:

Inherited genetic factors can influence the severity of COVID-19, but the molecular explanation underpinning a genetic association is often unclear. Intracellular antiviral defenses can inhibit the replication of viruses and reduce disease severity. To better understand the antiviral defenses relevant to COVID-19, we used interferon-stimulated gene (ISG) expression screening to reveal that OAS1, through RNase L, potently inhibits SARS-CoV-2. We show that a common splice-acceptor SNP (Rs10774671) governs whether people express prenylated OAS1 isoforms that are membrane-associated and sense specific regions of SARS-CoV-2 RNAs, or only express cytosolic, nonprenylated OAS1 that does not efficiently detect SARS-CoV-2. Importantly, in hospitalized patients, expression of prenylated OAS1 was associated with protection from severe COVID-19, suggesting this antiviral defense is a major component of a protective antiviral response.

You can find the SNP in you 23andMe raw data (unless you are on the recent chip; I looked for a tag variant but found none). If I’m reading the paper correctly, having the AA genotype increases your risk of severe COVID-19 by an odds of 1.58, all things equal. Not crazy bad, but not great either. The haplotype that carries the G allele in non-Africans seems to come from Neanderthals. In Africa, the ancestral G is the majority, though a minority of individuals are A, and that was passed on to Eurasians.

Here is a plot for the 1000 Genomes populations.

One thing I immediately noticed is that Peruvians have the highest frequency of the A allele in the dataset. Peru has had the highest COVID-19 death rate in the world, and its frequency of A means that a great number of people will be AA (the frequency of A squared).

I looked in Anders Bergstrom’s HGDP whole-genome data and found an interesting pattern in the frequencies of the G alelle:

PopulationFreqCount2N
Karitiana0022
Pima0026
Surui0016
Yakut0050
Maya0.04762242
Oroqen0.1111218
Tujia0.1111218
Peruvian0.111819170
She0.15320
Cambodian0.1667318

Three of the four populations with no copies of the protective G allele are indigenous to the Americas. The Maya, who are known to have European admixture, also have very low frequencies of the G allele. Now, it is true that East Asians also have low frequencies of the G allele (the Yakuts also lack it, so perhaps this was ancestral to Siberians?), but they may have other protective variants (or, suffered through an earlier coronavirus epidemic). I think OAS1 may turn out to be one of the loci that could be associated with a higher risk to severe COVD-19 in the New World.

Not all causes are treated equal

Over on Twitter the eminent population geneticist Molly Przeworski has an important and lauded thread up:

The thread has been widely re-tweeted and quote-tweeted by biologists. This prompted a response by a prominent sociologist, who quoted this from Kathryn Paige Harden’s discussion with Sam Harris:

What Harden is alluding to is that heritability within populations is not portable necessarily to between populations. In less sophisticated hands, this is almost used as an incantation. In my review of Harden’s book I said the following:

The biological reason that this extrapolation founders is that human populations differ, and those differences matter. The genetic architecture of intelligence may vary between populations so that predictions from the markers in one population are poorly predictive of variation in another, in line with the general concerns for GWAS portability…Harden points out correctly that population structure exhibits different layers of granularity and continuity. Perhaps a prediction trained on British samples is poorly predictive in Pakistanis. But what about Iranians? If it is poorly predictive in Iranians, what about in Bulgarians? The ability to infer within and between-group heritability is conditional on what you mean by “group,” and that is to some extent a subjective choice guided more by heuristics and instrumental utility than idealistic differences between races.

To be entirely frank I think Harden was on solid ground as a behavior geneticist with psychological training who relied on what population geneticists say publicly all the time about heritability and group differences. The issue is that I do not believe population geneticists were entirely candid about the deep texture of their assumptions, beliefs, and expectations. They wanted to be left alone to do their research, and so relied on a mantra to make people leave them alone, and now that mantra taken so literally is coming back to haunt them. One reason Prezworski’s thread got a lot of attention is privately this is the sort of intuition and sense that’s widely understood, but the issues are subtle, so to outsiders people just leave it off with the quick quips about portability. A friend told me “Molly doing this is like a goddess descending to Earth to speak to mere mortals so it will get a lot of attention.”

The real issue though is that some are now rather perturbed that Harden and behavior geneticists are trying to shield their study of psychological trait heritability from charges of racism by separating the discussion of between and within-group differences by implicitly reifying “population.” Additionally, some geneticists are quite unhappy at the discussion of heritability when it comes to psychological characteristics, so what was a convenient mantra to have people leave them alone is now coming back to haunt them, as it’s opening up avenues for research that they’re not comfortable with, are not interesting in, and believe are possibly dangerous. To be candid if I was Harden I’d be a bit peeved since all she’s doing is repeating what a lot of authorities in the field have been writing and saying for decades.

Nevertheless, if you take a look at the people who re-tweeted and commented on Przeworski’s thread it’s pretty much everyone. The high and mighty, all the way to the low. It was positively re-tweeted by people who are very skeptical of the study of heritability in psychological characteristics in humans (to be charitable). And, it was positively re-tweeted by me. Since so many people liked it and re-tweeted it, I can tell you it was re-tweeted by people who are actually quite open to and interested in the study of psychological characteristics in humans, within and between groups, without divulging confidence (I checked who commented and re-tweeted and liked).

So what’s going on? Prezworski’s group has published several papers in this area (for example, The evolution of group differences in changing environments), and one of the upshots for many is that there’s a lot less certainty about the heritability of many traits and its utility for polygenic risk scores even within groups because of uncorrected confounds. Some people took from this that polygenic risk scores are useless (not necessarily Prezworski and her group!). But when I talked about these findings with Amit Khera, who works on polygenic risk scores relating to cardiovascular disease, he was actually happy about these results. Why? Because he wanted to correct any confounds there were. He viewed these results not as a death knell for polygenic risk scores, but as a way to make them better, more accurate, more precise. He’s a medical doctor who is trying to help people in their health decisions. All he cares about is greater effectiveness. He’s not invested in a particular result, he’s invested in outcomes (OK, at least ideally, but I talked to him and his enthusiasm seemed genuine).

This is almost certainly why people who think polygenic risk scores are useful, and heritability in psychological characteristics are real, and vary widely in human populations, re-tweeted the Prezworski explainer. I myself did for this reason. My own current belief is there’s good evidence for heritability for a lot of behavioral traits, and that polygenic risk scores can be useful, at least on the margin. But we need to get better, and to do that, we need to explore all the subtle distinctions and details in relation to environmental and genetic variation. This is no guarantee. Perhaps the skeptics of polygenic risk scores will be correct (I doubt it, but who knows). But we’re not at the point where we can settle that question right now. More science needs to be done.

Finally, we need to address the magic of genes. People put a lot of stock in genes for various ideological reasons. But the reality is a lot of environmental factors taken for granted by many (e.g., shared home environment) are a lot less clear and well understood than genes are. And yet the skeptical takes don’t rain down on social science inferences and correlations. Mostly because they’re not seen as insidious because they’re environmental. But causes are causes. When there is a great deal of environmental variation in an outcome that doesn’t mean that you can control it, or you even know what it is. A lot of what is in the “E” in the ACE model is mysterious. Many focus on genes because they’re clear and distinct.

Men of the North

RegionI1I2*/I2aI2bR1aR1bGJ2J*/J1E1b1bTQN
Russia510.504661302.51.51.523
Lithuania66138500010.50.542
Latvia611401200.500.50.50.538
Estonia1530.53280102.53.50.534
Finland2800.553.50000.50061.5
Sweden371.53.51621.512.50302.57
Norway31.504.525.53210.501012.5
Denmark3425.515332.5302.5011

(Y chromosomal haplogroups)

A few weeks ago I saw the Y chromosomal haplogroup group distribution in Finland and Sweden. I’d know this disjunction for a while, but it really struck me. I got the numbers above from Eupedia, but you can find them elsewhere. Most of you probably know that Finland has a high fraction of N (they keep changing the nomenclature, so I’ll leave the number off). What’s curious to me is how low the fraction of N in the rest of Scandinavia is. Much of the N we see in Sweden may even be historical era migration of Finns into Sweden when the two nations were in political union (Finland was basically a Swedish colony).  Another notable fact is that N is very common among Baltic people, whether Finnic in the language (Estonian) or Indo-European (Latvian and Lithuanian).

Another strange thing is that while the Indo-European lineages of R1 are both at very low frequency in Finland, I1, which is common to the west in Scandinavia, is not. The latest ancient DNA makes it clear that Finnic languages seem to have arrived in the Baltic in the period between 1000 and 500 BC. Before then Corded Ware/Battle Axe people seem to have been dominant in the East Baltic. These people usually carried Y chromosome R1a.

The fact that N is so high in the Baltic nations shows that newcomers arrived, and in the northern region language shifted happened, but in the south, it did not. Meanwhile, further north in Finland almost all the R1a lineages disappeared. Not so with I1. There are all sorts of tortured explanations for this pattern, so I won’t offer one.

Genome-wide the Finns aren’t that different. The largest proportion of their ancestry is still Yamnaya/steppe:

I only post this to illustrate how strong “male-mediated” dynamics can be. The proportion of Siberian ancestry in Finns is rather low, but > 50% of their Y chromosomes are N. I think it is plausible that one of the reasons for the massive reduction in R1 in Finland might be due to climate change and massive population collapse among the Battle Axe people of southern Finland, and the later arrival of Siberians.

Pakistani British are very much like Indians genetically

I talked to Joe Henrich this week for The Insight about his book, The WEIRDest People in the World: How the West Became Psychologically Peculiar and Particularly Prosperous (the episode is going life next week). Obviously much of the discussion hinged around relatedness, kinship, and how that impacted the arc of history (we also talked about other issues, such as the status of the “big gods” debate, so most definitely tune in!).

So I was very curious when I saw a new preprint, Fine-scale population structure and demographic history of British Pakistanis:

Previous genetic and public health research in the Pakistani population has focused on the role of consanguinity in increasing recessive disease risk, but little is known about its recent population history or the effects of endogamy. Here, we investigate fine-scale population structure, history and consanguinity patterns using genetic and questionnaire data from >4,000 British Pakistani individuals, mostly with roots in Azad Kashmir and Punjab. We reveal strong recent population structure driven by the biraderi social stratification system. We find that all subgroups have had low effective population sizes (Ne) over the last 50 generations, with some showing a decrease in Ne 15-20 generations ago that has resulted in extensive identity-by-descent sharing and increased homozygosity. Using new theory, we show that the footprint of regions of homozygosity in the two largest subgroups is about twice that expected naively based on the self-reported consanguinity rates and the inferred historical Ne trajectory. These results demonstrate the impact of the cultural practices of endogamy and consanguinity on population structure and genomic diversity in British Pakistanis, and have important implications for medical genetic studies.

None of this is entirely surprising. The media in the UK has written about recessive disease load because of cousin-marriage amongst Pakistani Britons. But there are also things in the preprint that need to be made explicit. The “biraderi” social system is apparently a paternal lineage system in the northwest of the Indian subcontinent which transcends religion (i.e., it is present across the border in Indian Punjab). These are “tribal” or “clan” societies in a way that is not present across much of the Indian subcontinent. For example, my family is from eastern Bengal. Before the partition between India and Pakistan, the far northwest and northeast of the subcontinent had the highest proportions of Muslims. But that did not mean that the two regions were culturally very similar, explaining in part the war in 1971 that resulted in Bangladesh. In Bangladesh, biraderi is not known, and the rates of cousin-marriage are much lower than in Pakistan.

One of the things I immediately noticed in the 1000 Genomes data is that Bangladeshis exhibit a lot less structure and stratification than Indians and the samples from Pakistani Punjab. In many ways, the patterns in the Bangladeshi genomes resemble the type of patterns in non-South Asian genomes: an outbreeding population without much internal structure.

This is not typical in South Asia. Rather, Indian populations tend to have lots of differences between jati/caste groups due to endogamy. To my surprise, Pakistani samples from Lahore were similar, though I attributed some of that to the migration of people from India after 1947 (a similar pattern does not hold for Bangladesh, as only a small number of people migrated from India). Additionally, the runs of homozygosity among Pakistani populations indicated lots of consanguineous marriages. While some South Indians marry cousins, the practice is very rare among North Indian Hindus. Rather, the genetic homogeneity of North Indian Hindus is due to the very high endogamy rates. They do not marry outside of their caste.

The results from the British Pakistanis are roughly in line with the 1000 Genome Pakistanis, but in this case, the researchers had much more granular ethnic data, as well as information on whether individuals were or were not the product of cousin-marriages. In terms of worldwide population affinity, there isn’t a great surprise. The Pathans, who are Iranian speaking, were distinct. The groups with putative Arab ancestry (Syeds), did not seem to have much of that (really, any).

The figure above shows the long-term effective population size patterns. Within the preprint the authors note that these northwest Indian populations began to diverge ~2,000 years ago. That is roughly in line with what Moorjani et al. found for their Indian samples. This tells us that these Pakistani populations were part of the same cultural milieu as Hindu populations in India itself, whose caste endogamy did not seem to crystallized until about that time. This also seems to run against the thesis presented by some Pakistani nationalists that the northwestern populations were very distinctive “non-Hindu” mlecchas. Al-Biruni and earlier observers identified caste as distinctively Indian, and the likelihood of population structure emerging at the same period in the northwest indicates that these people are broadly part of that milieu.

But I want to focus on the more recent period. Using various methods the authors estimate that the effective population sizes of many of these groups dropped 10-20 generations ago. If you assume 10 generations with generation times of 15 years, that’s 150 years. If you assume 20 generations with generation times of 25 years, that gives you 500 years. So let’s take that as our interval. What’s going on here? I think what this may illustrate is the spread of Muslim practices among Islamicized peoples of the northwest.

In my podcast with Henrich he mentions that Islamic societies are peculiar in their ubiquitous practice of “parallel-cousin-marriage.” This means that brothers will marry their children off to each other (a contrast with “cross-cousin-marriage”, common in South India, where brothers and sisters marry their children to each other). The ubiquity of cousin-marriage among Pakistani Muslims is a contrast with genetically and culturally similar populations across the border in India (Indian Punjabis do not marry cousins if Sikh or Hindu).

Click to enlarge

The fact that this practice occurred among an endogamous group for many generations has consequences. The figure to the right illustrates just how homogeneous some of these groups are against a generic European reference population. And, the fact that even unrelated individuals from the same biraderi group are often quite related. As you can see even people whose parents are unrelated still exhibit excess runs of homozygosity. This is simply a function of pedigrees being narrow, as just in Indian castes these individuals share many not-so-recent-ancestors.

A positive note is that this high level of inbreeding does not apply to Pakistani Britons where both parents were born in the country. That means that biraderi dynamics are maintained due to continuous migration from Pakistan. They’re not perpetuating themselves in the UK.

I started this post with Joe Henrich for a reason: if Henrich is correct that the differences in social structure and relatedness matter for development and economists, then Pakistan and Bangladesh might have different trajectories. Bangladesh is a corrupt and familialist society, just like Pakistan. But, that familialism is not as robust and articulated as is the norm in Pakistan. A transition to a more high-trust and non-familial society is more viable and an easier lift for a non-tribal culture where clans do not extend much beyond first cousins.

20th century genetics as basic science and 21st century genetics as basic and applied

There was an offhand comment on Twitter that in the 1970s genetics was barely a field because we’ve made so much progress since then. For obvious reasons, many scientists took umbrage at this. I think it’s wrong and gives the lay public the incorrect impression. But, the reality is that I do think that the way the media and some geneticists have presented the development of the field since the understanding of DNA as the substrate of inheritance in the 1950s and the explosion of genomics in the 2000s has fed into this misimpression.

What’s the truth? Genetics predates genomics by a century or more, and DNA by decades. The basics of the field were elucidated by Gregor Mendel in the 1860s. He originated the “laws of inheritance”, though unfortunately his work was ignored by contemporaries. By “laws of inheritance” I mean that Mendel formulated an analytic model that allowed for discrete inheritance and predictions of the outcome of that inheritance. Naive human understanding of heritability usually relies on an intuitive “blended theory”. It works, after a fashion, but it does not explain many patterns we see around us (e.g., recessive expression).

Charles Darwin famously relied upon blended inheritance (in part) as a basis for the heritability which was essential to his theory of natural selection. But, a major problem with blended inheritance is that blending removes variation as everyone becomes a similar “mix”. This is not an issue with Mendelian inheritance, which is discrete. Alleles do not “mix”, but reconfigure every generation. Variation is retained. The “math” of evolution “works” in this manner.

The utility of Mendelian genetics is why the field exploded in the first two decades of the 20th century. Read A. H. Sturdevant’s 1913 paper on the first genetic map. I think it gives you a flavor of the rate of advancement. Genetics was definitely a field. In The Origins of Theoretical Population Genetics Will Provine outlines how this particular field of genetics developed between 1920 and 1940 to become the core of evolutionary biology. Again, this suggests that even before DNA genetics was an important field.

But, I do think it is fair to say that before 1950 genetics was very much “basic” science, and remained mostly so to the last decades of the 20th century.* DNA was interesting because it opened up the molecular biology revolution, but that had a very long fuse in terms of applications. PCR made it easier to do DNA testing, while new computing technologies made it much easier to generate and analyze data.

No one needs to be told about how genomics revolutionized genetics. But it’s major impact has been transforming an often theoretical field into a massively empirical one. Modern genomics is still underpinned by the logic of Mendelian genetics. Analysis.

* The main exception here I’m going to make is for agricultural genetics, but much of this work doesn’t need “genes” as such.

Sydney Brenner: the passing of a giant

Nobel laureate Sydney Brenner, who helped place Singapore on biotech world stage, dies at 92. If you are in genetics and development you know who Brenner his, and what he meant to these fields. I happened to be in the room with Brenner once, in Berkeley in 2008 I believe. He was already quite an old curmudgeon, and I will say his comments were amusing and awkward!

Long-time readers of this weblog know that about fifteen years ago I dabbled in a little worm-work. At that time I read In the Beginning Was the Worm: Finding the Secrets of Life in a Tiny Hermaphrodite. As Brenner was involved in promoting C. elegans as a model he occupies a lot of this book. I recommend it. It’s short and packed with historical nuggets that make the 21st-century trajectory of science more comprehensible.