Uyghur genetics and Kenneth Kidd – going beneath the surface

The latest episode of NPR’s “Planet Money” was interesting to me and touched upon issues I’ve been thinking on a lot. Stuck In China’s Panopticon has a genetic angle. The Chinese government seems to be identifying and tracking Uyghurs with genetics. Or at least has the capability to do so. That is, in part, thanks to the work of Kenneth Kidd.

If you have read this weblog for a long time, or are a geneticist, you know who Kenneth Kidd is. You may have used his Alfred database. Though Wikipedia states that Kidd has been doing science in China since 1981, the podcast suggested that Kidd’s work under scrutiny dates to 2010.

That’s important. Because the reality is that the Chinese government did not need this late sampling to genetically identify Uyghurs. The HGDP data set has 10 Uyghurs already. People had been publishing on the pop genetics of the Uyghurs for more than 10 years by the time Kidd did his sampling. Alfred has 94 Uyghurs. This is better than 10, but for forensic purposes of ethnic identification, it’s probably superfluous.

In 2008 two Chinese researchers had already published a population genetic analysis with a bigger sample size than the HGDP. Kidd is not on the author list, so I don’t think he was involved.

Basically, Uyghurs are a group that will show admixture between various East and West Eurasian ancestry components many generations ago. This was already known before 2010. Only a few groups within China, such as Kazakhs, are even close to similar in their profile.

There is one area where I think Kidd’s work may have been pushing the frontier a bit: doing genealogical matching on diverse Uyghurs. Though I can’t imagine you could get more close relatives, the greater geographic diversity would probably implicate many more pedigrees.

Ultimately I don’t think the big picture is about Kenneth Kidd. Yes, forensics, genetics, and the  Chinese government give many Americans nightmares. But thousands and thousands of scientists in America do work in China, with China, or are themselves of Chinese origin. American researchers develop technology that is later used in China to clamp down on various dissenters from the regime in an authoritarian manner. American consumers purchase goods and services that power the Chinese economy. American researchers collaborate with Chinese researchers and have indirectly furthered Chinese institutions such as the Beijing Genomics Institute.

I think we need to be honest that this implicates all of us in a globalized “just-in-time” world economy. Do the reporters interviewing Kidd use iPhones made in China?

And, it even goes well beyond China. In general, I think the United States is a force for good. But, as the world’s current superpower we have done some nasty things. Our democratically elected presidents, all of the recent ones, have sent people to their deaths for the good of the world (so they thought). We have intervened in nations and caused massive destruction and death, even though we meant well. Many non-Americans have a deep suspicion of our nation because of the dark shadow that it casts in certain circumstances.

There are bigger questions about power, morality, and individual responsibility and culpability that I wish we’d address, rather than focusing on a single researcher. Especially when I don’t think Kidd’s work was nearly as necessary and essential as the media portrays it.

Unleash the data kraken!


The Reich lab has done a mitzvah and released a huge merged dataset of their modern and ancient populations in a big tarball. Actually, there are two files. One of them is a larger number of individuals with 600,000 SNPs (includes “Human Origins Array”) and the other has 1,200,000 SNPs, but fewer individuals. It is in EIGENSTRAT format.

For the convenience of readers who are more comfortable in PLINK/PEDIGREE format, I’ve converted them, and replaced the family ID column with population labels. The links take to you a zip file that has the three files for the binary format.

The people of the Andaman Islands are not genetic fossils

So this is in the news, Police: American adventurer John Allen Chau killed by isolated Sentinelese tribe on Indian island. There is some talk about whether the guy was a Christian missionary or not, but that’s not really too relevant. Whether he believes in evolution or not (he was a graduate of a very conservative Christian college), he definitely won a Darwin award before he expired.

North Sentinel is totally isolated, and the people who live there, the Sentinelese, are out of contact with the rest of the world. They are hostile to the outside world. And this is probably why the Sentinelese are still around, as the outside world does not have a good track record with hunter-gatherers. The Andamanese as a whole had a reputation for being very hostile to outsiders, as traders knew not to stop too long for water.

Because the Sentinelese are back in the news, lots of stuff is being said about them in terms of their ancestry.

First, they are not that genetically unique. A recent paper on the genetics of Southeast Asia using ancient samples makes their affinities clear.  The Onge, an Andamanese tribe, are positioned close to the two ancient samples from Laos and Malaysia. They emerge out of the same milieu as Paleolithic Southeast Asians (whose  Hoabinhian culture persisted deep into the Holocene).

The Andamanese themselves are probably from mainland Southeast Asia. The gap between the islands and the mainland was smaller ~20,000 years ago when the sea levels were lower. They could have come up from the south or the north.

Second, they are not the most “ancient” people. That doesn’t make any sense. We are all people who are equally ancient. We all descend by and large outside of Africa from a migratory wave that expanded ~60,000 years ago. Andamanese, Chinese, and Europeans. What is “ancient” about them is that they are hunter-gatherers who have continued to practice that mode of production down to the present. But that’s a matter of culture and not genetics.

Third, in alignment with the above two points, they are not uniquely and distinctly isolated from all other human populations. They are not descendants of an early wave out of Africa preserved on these islands. They are not distinct from all other non-Africans. Rather, they seem to be closer to the peoples of Oceania, Papuans, and Australian Aboriginals, than Northeast Asians. And closer to Northeast Asians than they are to West Eurasians. The latest evidence is that the Andamanese were part of a broader diversification of lineages ~40-50,000 years ago to the east of India that gave rise to the peoples of the western Pacific Rim. Within this broader set of groups, some form a distinct clade that is not with Northeast Asians (often these are like “Australasian”).

Finally, the census size for the Sentinelese is in the range of 100 individuals. This seems on the edge of viability over the long term.

It’s raining selective sweeps

A week ago a very cool new preprint came out, Identifying loci under positive selection in complex population histories. It’s something that you can’t even imagine just ten years ago. The authors basically figure out ways to identify deviations of markers from expected allele frequency given a null neutral evolutionary model. The method is put first, which I really like, before getting to results or discussion. Additionally, they did a lot of simulation ahead of time. The sort of simulation that is really not possible before the sort of computational resources we have now.

Here’s the abstract:

Detailed modeling of a species’ history is of prime importance for understanding how natural selection operates over time. Most methods designed to detect positive selection along sequenced genomes, however, use simplified representations of past histories as null models of genetic drift. Here, we present the first method that can detect signatures of strong local adaptation across the genome using arbitrarily complex admixture graphs, which are typically used to describe the history of past divergence and admixture events among any number of populations. The method – called Graph-aware Retrieval of Selective Sweeps (GRoSS) – has good power to detect loci in the genome with strong evidence for past selective sweeps and can also identify which branch of the graph was most affected by the sweep. As evidence of its utility, we apply the method to bovine, codfish and human population genomic data containing multiple population panels related in complex ways. We find new candidate genes for important adaptive functions, including immunity and metabolism in under-studied human populations, as well as muscle mass, milk production and tameness in particular bovine breeds. We are also able to pinpoint the emergence of large regions of differentiation due to inversions in the history of Atlantic codfish.

On a related note in regards to selection, On the well-founded enthusiasm for soft sweeps in humans: a reply to Harris, Sackman, and Jensen. The authors are responding to a recent preprint criticizing their earlier work. The reason that it’s fascinating to me is that these sorts of arguments today are really concrete and not so theoretical. There’s a lot of data for analytic techinques to chew through, and computation has really transformed the possibilities.

A generation ago these sorts of debates would be a sequence of “you’re wrong!” vs. “no, you’re wrong!” Today the disputes involve a lot of data, and so have a reasonable chance of resolution.

The first preprint identifies the usual candidates in humans that you normally see, and expected targets in cattle and cod. Sure, that will given biologists more interested in mechanisms and pathways things to chew upon, but imagine once researchers have large numbers of genomes for thousands and thousands of species. Then they’ll be testing deviations from neutral allele frequencies across many trees, and getting a more general and abstract sense of the parameter that selection explores, conditional on particularities o evolutionary history.

This is why I’m excited about plans to sequence lots and lots of species.

The post-neutral human genome (the Kern-Hahn era)

If you have any background in evolutionary biology you are probably aware of the controversy around the neutral theory of molecular evolution. Fundamentally a theoretical framework, and instrumentally a null hypothesis, it came to the foreground in the 1970s just as empirical molecular data in evolutionary was becoming a thing.

At the same time that Motoo Kimura and colleagues were developing the formal mathematical framework for the neutral theory, empirical evolutionary geneticists were leveraging molecular biology to more directly assay natural allelic variation. In 1966 Richard Lewontin and John Hubby presented results which suggested far more variation than they had been expecting. Lewontin argued in the early 1970s that their data and the neutral model actually was a natural extension of the “classical” model of expected polymorphism as outlined by R. A. Fisher, as opposed to the “balance school” of Sewall Wright. In short, Lewontin proposed that the extent of polymorphism was too great to explain in the context of the dynamics of the balance school (e.g., segregation load and its impact on fitness), where numerous selective forces maintained variation. The classical school emphasized both strong selective sweeps on favored alleles and strong constraint against most new mutations.

And yet one might expect low levels of polymorphism from the classical school. The way in which the neutral framework was a more natural extension of this model is that even if most inter-specific variation, most substitutions across species, are due to selectively neutral variants, most variants could nevertheless be deleterious and so constrained. Alleles which increase in frequency may have done so through positive selection, or, just random drift. Not balancing forces like diversifying selection and overdominance.

The general argument around neutral theory generated much acrimony and spilled out from the borders of population genetics and molecular evolution to evolutionary biology writ large. Stephen Jay Gould, Simon Conway Morris, and Richard Dawkins, were all under the shadow of neutral theory in their meta-scientific spats about adaptation and contingency.

That was then, this is now. I’ve already stated that sometimes people overplay how much genomics has transformed our understanding of evolutionary biology. But in the arguments around neutral theory, I do think it has had a salubrious impact on the tone and quality of the discourse. Neutral theory and the great controversies flowered and flourished in an age where there was some empirical data to support everyone’s position. But there was never enough data to resolve the debates.

From where I stand, I think we’re moving beyond that phase in our intellectual history. To be frank, some of the older researchers who came up in the trenches when Kimura and his bête noire John Gillespie were engaged a scientific dispute which went beyond conventional collegiality seem to retain the scars of that era. But younger scientists are more sanguine, whatever their current position might be because they anticipate that the data will ultimately adjudicate, because there is so much of it.

With that historical context, consider a new paper, Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences:

Disentangling the effect on genomic diversity of natural selection from that of demography is notoriously difficult, but necessary to properly reconstruct the history of species. Here, we use high-quality human genomic data to show that purifying selection at linked sites (i.e. background selection, BGS) and GC-biased gene conversion (gBGC) together affect as much as 95% of the variants of our genome. We find that the magnitude and relative importance of BGS and gBGC are largely determined by variation in recombination rate and base composition. Importantly, synonymous sites and non-transcribed regions are also affected, albeit to different degrees. Their use for demographic inference can lead to strong biases. However, by conditioning on genomic regions with recombination rates above 1.5 cM/Mb and mutation types (C↔G, A↔T), we identify a set of SNPs that is mostly unaffected by BGS or gBGC, and that avoids these biases in the reconstruction of human history.

This is not an entirely surprising result. Some researchers in human genetics have been arguing for the pervasiveness of background selection, selection against deleterious alleles which effects nearby regions, for nearly a decade. In contrast, there are others who argue selective sweeps driven by positive selection are important in determining variation. Unlike the 1970s and 1980s these researchers don’t evince much acrimony, in part because the data keeps coming, and ultimately they’ll probably converge on the same position. And, the results may differ by species or taxon.

If you want a less technical overview than the paper, Kelley Harris has an excellent comment accompanying it. If you want to know what I mean by the Kern-Han era, it’s a joke due to the publication of The Neutral Theory in Light of Natural Selection.

Finally, some of you might wonder about the implications for demographic inference which preoccupies me so much on this weblog. In the big picture, it probably won’t change a lot, but it will be important for the details. So this is a step forward. That being said, the possibility of variable mutation rates and recombination rates across time and between lineages are also probably quite important.

The population genetic structure of China (through noninvasive prenatal testing)


This week a big whole genome analysis of China was published in Cell, Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History. The abstract:

We analyze whole-genome sequencing data from 141,431 Chinese women generated for non-invasive prenatal testing (NIPT). We use these data to characterize the population genetic structure and to investigate genetic associations with maternal and infectious traits. We show that the present day distribution of alleles is a function of both ancient migration and very recent population movements. We reveal novel phenotype-genotype associations, including several replicated associations with height and BMI, an association between maternal age and EMB, and between twin pregnancy and NRG1. Finally, we identify a unique pattern of circulating viral DNA in plasma with high prevalence of hepatitis B and other clinically relevant maternal infections. A GWAS for viral infections identifies an exceptionally strong association between integrated herpesvirus 6 and MOV10L1, which affects piwi-interacting RNA (piRNA) processing and PIWI protein function. These findings demonstrate the great value and potential of accumulating NIPT data for worldwide medical and genetic analyses.

In The New York Times write-up there is an interesting detail, “This study served as proof-of-concept, he added. His team is moving forward on evaluating prenatal testing data from more than 3.5 million Chinese people.” So what he’s saying is that this study with >100,000 individuals is a “pilot study.” Let that sink in.

Read More

The derived SNP that causes dry earwax was not found in all non-Africans

A new paper on Chinese genomics using hundreds of thousands of low-coverage data from NIPT screenings is making some waves. I’ll probably talk about the paper at some point. But I want to highlight the frequency of rs17822931 in Han Chinese. It’s pretty incredible how high it is.

Because the derived variant SNP, which is correlated with dry flaky earwax when present in homozygote genotypes, is also associated with less body odor, it has been studied extensively by East Asian geneticists. Basically, individuals who are homozygote for the ancestral SNP, which is the norm in Europe, the Middle East, and Africa, tend to have more body odor, and in societies and contexts where this is offensive these people are subject to more ostracism in East Asia as they are a minority (some of the studies in Japan were motivated by conscripts who elicited complaints from their colleagues).

The relatively low frequency in Guangxi is to be expected. This province was Sinicized only recently. As in, the last 500 years. And it still retains a huge ethnic minority population, and many of the Han in the province likely have that ancestry. But the question still arises: why do the Han have such a high frequency of rs17822931?

Here’s a plot of frequencies:

Read More

Chinese and Indian American population genetic structure

In Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past David Reich makes the observation that India is a nation of many different ethnicities, while China is dominated by a single ethnicity, the Han. This is obviously true, more or less. Even today the vast majority of Indians seem to be marrying with their own communities, jati.

Over the years I’ve collected many different genotypes of Americans of various origins who have purchased personal genomics kits, and given me their raw results. I decided to go through my collection and strip detailed ethnic labels and simply group together all those individuals from India, and China, who have had their genotypes done from one of the major services.

I suspect that these individuals are representative of “Indian Americans” and “Chinese Americans.” So what’s their genetic structure?

Read More

How related should you expect relatives to be?

Like many Americans in the year 2018 I’ve got a whole pedigree plugged into personal genomic services. I’m talking from grandchild to grandparent to great-aunt/uncles. A non-trivial pedigree. So we as a family look closely at these patterns, and we’re not surprised at this point to see really high correlations in some cases compared to what you’d expect (or low).

This means that you can see empirically the variation between relatives of the same nominal degree of separation from a person of interest. For example, each of my children’s’ grandparents contributes 25% of their autosomal genome without any prior information. But I actually know the variation of contribution empirically. For example, my father is enriched in my daughter. My mother is my sons.

The sample principle applies to siblings. Though they should be 50% related on their autosomal genome, it turns out there is variation. I’ve seen some papers large data sets (e.g., 20,000 sibling pairs) which gives a standard deviation of 3.7% in relatedness. But what about other degrees of relation?

Read More

David Burbridge’s 10 questions for A. W. F. Edwards In 2006

A few years ago I watched a documentary about the rise of American-influenced rock music in Britain in the 1960s. At some point, one of the Beatles, probably Paul McCartney, or otherwise Eric Clapton, was quoted as saying that they wanted to introduce Americans to “their famous people.” Though patronizing and probably wrong, what they were talking about is that there were particular blues musicians who were very influential in some British circles were lingering in obscurity in the United States of America due to racial prejudice. The bigger picture is that there are brilliant people who for whatever reason are not particularly well known to the general public.

This is why I am now periodically “re-upping” interviews with scientists that we’ve done on this weblog over the past 15 years. These are people who should be more famous. But aren’t necessarily.

In 2006 David Burbridge, a contributor this weblog and a historian of things Galtonian, interviewed the statistical geneticist A. W. F. Edwards. Edwards was one of R. A. Fisher’s last students, so he has a connection to a period if history that is passing us by.

I do want to say that his book, Foundations of Mathematical Genetics, really gave me a lot of insights when I first read it in 2005 and began to be deeply interested in pop gen. It’s dense. But short. Additionally, I have also noticed that there is now a book out which is a collection of Edwards’ papers, with commentaries, Phylogenetic Inference, Selection Theory, and a History of Science. Presumably, it is like W. D. Hamilton’s Narrow Roads of Gene Land series. I wish more eminent researchers would publish these sorts of compilations near the end of their careers.

There have been no edits below (notice the British spelling). But I did add some links!

David’s interview begins after this point:

Read More