The population genetic structure of China (through noninvasive prenatal testing)

Posted on October 6, 2018October 6, 2018 by Razib Khan

This week a big whole genome analysis of China was published in Cell, Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History. The abstract:

We analyze whole-genome sequencing data from 141,431 Chinese women generated for non-invasive prenatal testing (NIPT). We use these data to characterize the population genetic structure and to investigate genetic associations with maternal and infectious traits. We show that the present day distribution of alleles is a function of both ancient migration and very recent population movements. We reveal novel phenotype-genotype associations, including several replicated associations with height and BMI, an association between maternal age and EMB, and between twin pregnancy and NRG1. Finally, we identify a unique pattern of circulating viral DNA in plasma with high prevalence of hepatitis B and other clinically relevant maternal infections. A GWAS for viral infections identifies an exceptionally strong association between integrated herpesvirus 6 and MOV10L1, which affects piwi-interacting RNA (piRNA) processing and PIWI protein function. These findings demonstrate the great value and potential of accumulating NIPT data for worldwide medical and genetic analyses.

In The New York Times write-up there is an interesting detail, “This study served as proof-of-concept, he added. His team is moving forward on evaluating prenatal testing data from more than 3.5 million Chinese people.” So what he’s saying is that this study with >100,000 individuals is a “pilot study.” Let that sink in.

The lineage of the ancient sage kings

Posted on September 25, 2018September 30, 2018 by Razib Khan

After recording the “India genetics” podcast for The Insight and reading Early China: A Social and Cultural History, I wonder what surprises we’re going to get from China from ancient DNA when it comes online. If there is one thing we are learning by looking closely at DNA, modern and ancient, it’s that at least for humans there are very few ‘primal’ populations from the “Out of Africa” event which haven’t been threaded together from pulse admixtures of continuous gene flow across the landscape.

Early China makes it clear that Erlitou culture which dates from ~1900 to 1500 BC was almost certainly the legendary Xia dynasty. This means that the ethnogenesis of the modern Han Chinese probably dates to the latest ~4,000 years ago. This is centuries before the Indo-Aryans were likely arriving in South Asia, and around the same time that Indo-European groups were pushing into peninsular Southern Europe.

The Y chromosome data does not indicate a Bronze Age ‘star phylogeny’ expansion in East Asia that I know of, so the dynamics were not entirely similar to Western Eurasia. But, it seems quite plausible that the Han themselves are not a chrysalis from the late Pleistocene.

India vs. China, genetically diverse vs. homogeneous

Posted on July 15, 2018July 16, 2018 by Razib Khan

About 36% of the world’s population are citizens of the Peoples’ Republic of China and the Republic of India. Including the other nations of South Asia (Pakistan, Bangladesh, etc.), 43% of the population lives in China and/or South Asia.

But, as David Reich mentions in Who We Are and How We Got Here China is dominated by one ethnicity, the Han, while India is a constellation of ethnicities. And this is reflected in the genetics. The relatively diversity of India stands in contrast to the homogeneity of China.

At the current time, the best research on population genetic variation within China is probably the preprint A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese. The author used low-coverage sequencing of over 10,000 women to get a huge sample size of variation all across China. The PCA analysis recapitulated earlier work. Genetic relatedness among the Han of China is geographically structured. The largest component of variance is north-south, but a smaller component is also east-west. The north-south element explains more than 4.5 times the variance as the east-west.

The great genetic map and history of China

Posted on August 1, 2017August 2, 2017 by Razib Khan

About 20 percent of the world’s population is Chinese (and since over 90% of Chinese citizens are ethnically Han, so by Chinese here I mean Han to a first approximation). In comparison to other non-European groups a fair amount of genetics research has been done with Chinese populations. But in comparison to their overall numbers, not too much has really been done. That will change.

A new preprint, A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese, aims to enrich our knowledge set somewhat. The authors used low coverage next generation sequencing to get increase their sample sizes greatly (cheaper). By low coverage, I mean instead of hitting each genetic position on average 30 times or more, as is in the norm in medical genomics, they sampled a position closer to twice.

But while any given genome was usually not given much close attention, their overall sample size of individuals was 11,670 Han Chinese women. Impressive This means that if they called a position as a variant, they could assess their confidence that it was a variant by looking at how many times it was called as a variant across their data set (as coverage declines one’s confidence that a call of a variant is a true call declines because there is a relatively high base rate of error set against the proportion of true expected polymorphisms; in contrast if you sample 30 times the error rate gets overwhelmed by repeated sampling). Overall they counted 25,057,223 variants, which sounds about right. They also found 548,401 novel variants with at least a count of 10 in the data set (a ~0.04% allele frequency, so a very low cut-off).

The most important thing about this preprint is not that the sample size is large enough that they could detect low frequency variants and add to the catalog. No, for me, it is that they sampled so many of the provinces. As you can see in the figure up top just like Europe China’s Han population recapitulate the map of China. That is, populations arrange themselves spatially when projected onto a principle components analysis plot in the same manner that they do geographically. This is a new finding in some ways because previous sampling strategies had not been robust enough to detect the east-west cline (though to be honest if you looked at the Chinese samples in the 1000 Genomes there was suggestion of this).

All that being said, please note that the PCA is not to scale, insofar as most of the variation is north-south (4 to 5 times more than east-west). Rather like Europe in this regard. Part of this difference is due to the fact that gene flow from non-Han populations, particularly in the South, inflate the genetic variation on the first dimension. Another aspect of interest is that genetic variation between Han populations is rather low to begin with.

One way to visualize this is a matrix like the one to the left. You see pairwise population Fst statistics. The largest is between Guangdong in the south, home to Hong Kong and Guangzhou (Canton), and the northern provinces. The Fst value between Guangdong and Shanxi in the center-north is 0.0029. You may know that the Fst value between Han Chinese and Northern Europeans is ~0.10. A 34 factor difference, more than one order of magnitude. As a point of comparison you can find Fst tables which show values between English and Croations and English and Spaniards are about the same as between Guangdong and Shanxi.

What is just as interesting is the very low genetic differentiation on the North China plain. Why is this? There are two reasons I can think of. The easy explanation is that across politically unified flat landscapes gene flow occurs so easily that genetic differences disappear over time.

But, this presupposes there were genetic differences in the first place. The reason I say this is that though there was a early period of migration from the north to the south (from the Han dynasty onward), and absorption of non-Chinese peoples, there were also periods when much of China north of the Yangtze river valley was under barbarian domination or politically unstable. Elite northern families fled to the south, and eventually when political stability reemerged migrated back to the north (similarly, persistent north-south migration occurred, as the Hakka people of South China are clearly of northern provenance).

The low genetic differentiation across northern China may then be thought of as the outcome of structural fixtures of the landscape (no mountains to obstruct gene flow), as well as possibly due to historical instances of copious back-migration from various regions of southern China (or perhaps more accurately Central China, as I’m presuming much of the settlement would come from the lower Yangtze river valley). Both of these dynamics may have led to little intra-regional structure. In contrast you notice that genetic distance between Fujian and Guangdong, two regions adjacent to each other in the South, is still higher than between any of the northern regions.

Again, this is not surprising due to both geography and history. The dialect map of China shows that southeast China is more fragmented than the north (or southwest). These differences are long-standing and date to the initial founding of Han communities in the south via migrants from the north. Unlike North China South China is a topographically diverse landscape, with beautiful escarpments and deep gorges. Fujian literally hugs the ocean, and has long had a relationship to overseas communities for this reason. Geographic barriers mean there are genetic barriers. Combined with admixture with local populations this means it is not surprising that there were greater genetic differences between southern regions than in the north.

Additionally, China south of the Yangtze has been relatively shielded from foreign conquest and invasion compared to the North China plain. Obviously events like the Taiping rebellion and famine more generally had impacts on South China, but North China has had more periods of domination in a destabilizing manner by non-Chinese invaders over the past 2,000 years.

Perhaps more intriguing than the modern genetic relationships within China are the relations with non-Chinese populations. It is not surprising that the South Chinese populations show evidence of admixture with Dai and Tawainese aboriginals (the basal group of the Austronesian migration). The genetics and cultural practices in parts of South China have long suggested relationships to indigenous groups, as well as Sinicization. Honestly I suspect many were surprised how similar North and South Chinese were, indicating either continuous gene flow or descent from a large demographic expansion.

More curious is that some North Chinese seem to show evidence of admixture with West Eurasians. In particular, they show affinities with European populations. Again, this is not surprising. Some earlier analyses have shown evidence of European-like admixture in northern China, and among ethnic groups like Mongolians. More precisely there are strong signals of European-like admixture in the northwestern provinces of Gansu, Shaanxi, and Shanxi.

The details here are important though. The authors note that Hellenthal et al. detected admixture in the from Northern Europeans into North China using haplotype based methods to around 1200 AD. This preprint finds a similar admixture date. But they caution that these admixture dates may only signal the latest of the events.

As for what that event could be, there was clearly turmoil on the Silk Road in the years around 1000 AD. After 750 AD for all practical purposes the Chinese lost control of their portion of the Silk Road, what is now Xinjiang. Turkic groups like Uyghurs and Iranian ones such as Sogdians were prominent in China due to a power vacuum (the Uyghurs were used by the Tang emperors like the Germans were used by the later Roman Emperors, as federates). Later on one saw the emergence of Tanguts, various groups from Manchuria, and finally the Mongols. Since both haplotype based methods and these preprint suggest something around 1000 AD, the most likely candidate was the absorption of Central Asians with some European-like ancestry into the Chinese substrate. The Uyghur conquest of the major cities of in the centuries before the rise of the Mongols famously resulted in the assimilation of a European-like population which had earlier spoken Indo-European languages.

But admixture was not a feature of just recent Chinese history. The figure to the right is somewhat difficult to read, but it shows on the y-axis variance in the f3 statistic. In short, how well does the Chinese data set here form a clade with the outgroup, and how much does that statistic vary between groups. The x-axis is for the D statistic, which measures the relationship of four populations, with two clades. On the bottom left you see the Siberian genome from 45,000 years ago. On the y-axis you can see all provinces show very little variation, and that’s because the Siberian genome is old enough that it is basal to all the Chinese and Europeans. The D statistic indicates no gene flow between the Siberian populations and modern groups. Not so with other populations. You see the Pleistocene European populations are shifted to the right, and that’s because they all contribute to later Europeans. The Chinese-European clade is not a good fit. This is true across the Chinese populations (so the variance of the f3 statistic is very low),.

Also in the text they note that there is high shared drift with the three “Ancient North Eurasian” (ANE) samples from Siberia. This is discussed extensively in the supplements to Lazaridis et al. 2016. Another replicated finding is that the Chinese share drift with ancient European hunter-gatherers. The drift declines later on, likely because the Chinese do not share as much drift with the early farmers. This is due in part to the “Basal Eurasian” (BEu) element. But in Fu et al. 2016 they observe that drift between East Eurasians and European hunter-gatherers increases after 15,000 years BP, when there was a genetic turnover, and the Villabruna cluster (in their terminology) came to dominate the landscape.

The most probable, though not certain, explanation for this pattern is that ANE populations contributed ancestry to both antipodes of Eurasia. To European hunter-gatherers, and, to the ancestors of the Chinese in Pleistocene East Asia (remember that there was a fusion between a proto-East Asian population and ANE to give rise to the ancestors of Amerindians 15-20,000 years ago). Another explanation could be East Asian gene flow rather early on into Europe, some time after the Last Glacial Maximum ~20,000 years ago. We don’t have the sample density outside of Europe to really say with certainty.

Finally, I have to mention that at SMBE Melinda Yang of Qiaomei Fu’s lab gave a talk about the Tianyuan genome. Their group has found that the Tianyuan individual, who dates to 40,000 years ago, is the likely ancestor of modern East Asians. That is, Tianyuan shares more drift with modern East Asians than Europeans. No huge surprise. What was surprising though is that Tianyuan also shared appreciable drift with GoyetQ116, a 35,000 year old sample from Belgium, whose descendants seem to have played a role in the emergence of the Magdalenian culture. But not later European hunter-gatherer populations. The Tianyuan sample also seemed to share some drift with Australasian samples (a possible resolution for why some Amerindians share drift with Oceanians presents itself here obviously). Overall, the group’s conclusion was that this might be evidence of ancient population structure rather early on in the “Out of Africa” populations, which eventually carried over as the groups dispersed (rather than each geographic region being direct descendants from a single panmictic “Out of Africa” group). The implications here are beyond the purview of Chinese genetics so I’ll address it in a later post.

I have to mention there is a fair amount within this paper on selection as well as medical genetics. I didn’t tackle that in this post since there’s so much phylogenomics one could talk about.

Citation: A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese, Charleston W. K. Chiang, Serghei Mangul, Christopher R. Robles, Warren W.Kretzschmar, Na Cai, Kenneth S. Kendler, Sriram Sankararam, Jonathan Flint

bioRxiv 162982; doi: https://doi.org/10.1101/162982