The population genetic structure of China (through noninvasive prenatal testing)

This week a big whole genome analysis of China was published in Cell, Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History. The abstract:

We analyze whole-genome sequencing data from 141,431 Chinese women generated for non-invasive prenatal testing (NIPT). We use these data to characterize the population genetic structure and to investigate genetic associations with maternal and infectious traits. We show that the present day distribution of alleles is a function of both ancient migration and very recent population movements. We reveal novel phenotype-genotype associations, including several replicated associations with height and BMI, an association between maternal age and EMB, and between twin pregnancy and NRG1. Finally, we identify a unique pattern of circulating viral DNA in plasma with high prevalence of hepatitis B and other clinically relevant maternal infections. A GWAS for viral infections identifies an exceptionally strong association between integrated herpesvirus 6 and MOV10L1, which affects piwi-interacting RNA (piRNA) processing and PIWI protein function. These findings demonstrate the great value and potential of accumulating NIPT data for worldwide medical and genetic analyses.

In The New York Times write-up there is an interesting detail, “This study served as proof-of-concept, he added. His team is moving forward on evaluating prenatal testing data from more than 3.5 million Chinese people.” So what he’s saying is that this study with >100,000 individuals is a “pilot study.” Let that sink in.

The PCA at the top of the post is a bit busy, so I want to highlight the salient aspect. These results confirm that 5-10% of the ancestry of the Hui, Chinese speaking Muslims, is West Eurasian. The Uygur and Kazakh are about ~40% on the left of the plot. The authors note that the Manchus overlapped almost perfectly with individuals sampled from Northern China. This is expected because by the end of the Ching dynasty most of the Manchus had been fully Sinicized, and in the 20th century fully assimilated. Recently due to an emphasis on “national minorities” and some privileges granted therein many people have identified as Manchu due to some ancestry who in all other ways simply northern Han (the Manchu language is moribund).

The sections on particular adaptations which vary by region are not surprising. In books like The Retreat of Elephants the slow, gradual, and inexorable expansion of the Chinese beyond the Yangzi basin is described in a way that makes it clear that southern diseases and climate were a major impediment. But through a process of acclimation, assimilation of local peoples, and adaptation, by 1000 AD the center of demographic gravity had shifted to the south.

There is a section of the text which I think will be falsified though:

After removing participants with 49bp read length and with sequencing error rate >0.00325, a principal component analysis of 45,387 self-reported Han Chinese from the 31 administrative divisions showed that the greatest differentiation of Han Chinese is along a latitudinal gradient (Figures S3E and S3F), consistent with previous studies (Chen et al., 2009, Xu et al., 2009). In contrast, there is, perhaps surprisingly, very little differentiation from East to West. This observation may be explained by the fact that a large proportion of the western Han populations in China are recent immigrants organized by the central government starting from 1949 when the People’s Republic of China was founded (Liang and White, 1996).

I don’t think there’s any need to make recourse to migration from 1949 and after. The argument in Guns, Germs, and Steel suffices: it’s just easier to move across latitudes than longitudes. The people of the north eat noodles made from wheat, and the people of the south eat rice. This is a big cultural transition for peasants to make, and so it didn’t happen as often as moving to the coast, or inland. We have documented instances of mass migrations from adjacent provinces due to famine and political instability. In the 17th century conflicts resulted in the depopulation of Sichuan and the arrival of large numbers of people from Hunan and Hubei to the east.

The plot below is one of the more interesting ones from the paper. From left to right, private alleles found in the HapMap Utah whites also found in all individuals in a given province, and then just Han, and then private alleles to ethnic Telugu Indians (from South India) found in all individuals in a given province, and then just Han.

Click to enlarge

The first thing to notice is that there is a correlation between the Han and non-Han. This shouldn’t be surprising. Plenty of ethnic groups have become Han through acculturation and become demographically absorbed. This is probably truer in parts of the south than in the north, but southern Chinese ethnic minorities are genetically and culturally much more like the Han in the first place.

Private alleles shared with Northern Europeans (CEU) almost certainly has to do with the interaction sphere of the steppe pastoralists, which extends from the Carpathians to Mongolia. The relatively high frequency of R1a, and to a lesser extent R1b, among many Turkic/Central Asian peoples is a pretty good sign of where this West Eurasian ancestry comes from.

The Indian affinity is perhaps more interesting. To be honest I was surprised at the high affinity in Yunnan and Hainan. Tibet has strong cultural connections to India through its form of Buddhism. But its interesting that Qinghai, where many Tibetans also live, does not have the affinity with India. What’s going on in the other provinces? I suspect that the aboriginal peoples assimilated by the Han and other groups in this region probably had some distant connections to the non-West Eurasian ancestry in South Asia.


6 thoughts on “The population genetic structure of China (through noninvasive prenatal testing)

  1. First time I went to Hainan, it was interesting to me that there are a couple of prominent, and very old looking, mosques in Sanya; possibly more that might not be so noticeable.

    The other thing I found interesting is that the local Han, as opposed to Han from other parts of China who have moved there in recent time to work, are very small in stature and pale brownish in skin tone, and very friendly and talkative. That’s as opposed to the aboriginal ethnic minorities, who are definitely not friendly or talkative. So, I assumed that was due to mixing – question is, mixing with whom?

  2. If I understand correctly, the Xibe still (to some extent) identify as a separate ethnic group and speak a Manchu-related language. Is anything known about their genetics?

  3. In Outer Manchuria(Primorsky Krai, Amur Oblast, Jewish Autonomous Oblast, Khabarovsk Krai) in Russia, I believe that there are still some Manchu/Jurchen related people(Udege, Nanai, Ulchi).

    Has there been studies done on them?

  4. Xibe Y-DNA (from the HGDP sample set; cf. Shi et al. 2010, Lippold et al. 2014, YFull, etc.)
    1/8 C-L1373(xM48, M401) [probably C-F1756]
    1/8 C-M77/M86(xB469)
    1/8 C-B469*(xB89)
    1/8 J2a1h-M530
    1/8 K-M9(xL-M20, M1-M106, NO-M214, P-M45) [most likely haplogroup T; less likely, it may be a rare instance of K*, most of which have been found in South Asia, Southeast Asia, and Oceania, or a geographically outlying member of Near Oceanian haplogroup S]
    1/8 N-Tat [probably N-F4205; TMRCA with an ethnic Mongol is approximately 1,500 ybp; that pair’s TMRCA with a pair of Russians is approximately 4,000 ybp]
    1/8 O-M122(xO2a1-KL1/L465, O2a2-IMS-JST021354/P201)
    1/8 O-CTS335*(xCTS3856) [a branch of O-F444; O-CTS3856 has been found in at least Hunan Han, Beijing Han, and Thailand, and another instance of O-CTS335*(xCTS3856) in at least one Japanese in Tokyo; TMRCA estimated to be 4,200 [95% CI 3,100 5,300] ybp]

    Xibe mtDNA (from the HGDP sample set)
    1/9 U4a2a1 [U4a is found in Spain, Italy, Czech, Slovakia, Poland, Belarus, Russia, Finland. U4a2 is found in England, Slovakia, Poland, Bulgarian Turks, Belarus, Estonia, Finland, Russia, Pamir. U4a2a is found in Spain, Italy, Czech, Slovakia, Serbia, Poland, Lithuania, Sweden, Russia, Armenian. U4a2a1 is found in Serbia, Swedish.]
    1/9 F1a1c [Russia, Mongolia, Japan, China, Tibet, Thailand, Moken]
    1/9 R11b [Altai-Kizhi, China, Tibet, Thailand-Laos]
    1/9 C4a1’5 [all over Central Asia and southern Siberia]
    1/9 C5a1 [Ulchi, Khanty, Yakut, Altai-Kizhi, Buryat, Bargut, Severo-Evensk, Khamnigan, Mongolia, Uyghur]
    2/9 C5d1 [Yukaghir, Tuvan, Altai-Kizhi, Stony Tunguska Evenk, Tompo Even, Kamchatka Even]
    1/9 Z3a [China, Tibet; Z3a1 is found in Yakut, China, Thailand-Laos]
    1/9 D4b2 [Japan, Russia, Buryat, Bargut, China, Uyghur, Tibet, Pamir, Thailand, Thailand-Laos, Armenian, Saudi Arabia]

    Xibe Y-DNA (Xue et al. 2006)
    1/41 = 2.4% BT-SRY10831.1(xC-M130, DE-YAP, J-12f2, K-M9) [probably Western Eurasian G-M201, European I-M170, or mainly South Asian H-L901/M2939]
    9/41 = 22.0% C-M217(xM48) [typically “Altaic,” East Asian, or aboriginal North American]
    2/41 = 4.9% C-M48 [typically Tungusic, Nivkh, Yukaghir, Chukotko-Kamchatkan, Mongolic, or Turkic]
    1/41 = 2.4% DE-YAP(xE-M40) [typically Andamanese, Tibetan, or Japanese]
    3/41 = 7.3% J-12f2 [typically Southwest Asian or Mediterranean]
    2/41 = 4.9% K-M9(xNO-M214, P-92R7)
    4/41 = 9.8% N-LLY22g(xM128, P43, Tat) [typically found in populations of western/southwestern China, such as Yi]
    1/41 = 2.4% N-M128
    2/41 = 4.9% N-M178
    3/41 = 7.3% O-M119 [typically found in populations of southeastern China, Daic peoples, and Austronesian peoples]
    1/41 = 2.4% O-M176(x47z) [typically Korean]
    2/41 = 4.9% O-M122(xM159, M7, M134) [LINE1-]
    2/41 = 4.9% O-M122(xM159, M7, M134) [LINE1+]
    5/41 = 12.2% O-M134(xM117) [possibly from assimilation of local Kazakhs]
    2/41 = 4.9% O-M117
    1/41 = 2.4% P-92R7(xR1a-SRY10831.2)

    Xibo Y-DNA (Shou et al. 2010)
    4/32 = 12.5% C-M130
    1/32 = 3.1% J-M304(xJ2-M172)
    3/32 = 9.4% K-M9(xN-M231, O-M175, P-M45)
    5/32 = 15.6% N-M231
    7/32 = 21.9% O-M175(xM119, M95, M122) [Note this major difference from the Xibe samples of Xue et al. 2006 and the HGDP. O-M175(xM119, M95, M122) most likely should belong to O1b1-F2320(xM95), which is most often found among Han Chinese (approx. 5%), or to O1b2-M176, which is most often found among Japanese and Koreans (approx. 30%).]
    4/32 = 12.5% O1a-M119
    2/32 = 6.3% O-M122(xM134)
    5/32 = 15.6% O-M134
    1/32 = 3.1% R1a1-M17

    Xibe Y-DNA (Zhong et al. 2010, 2011)
    2/61 = 3.3% D-M174
    2/61 = 3.28% C-M130(xM8, M38, M217, M347, M356, P55)
    12/61 = 19.67% C-M217(xM93, P39, M48, M407, P53.1, P62)
    6/61 = 9.84% C-P53.1
    2/61 = 3.28% C-M356 [Usually found among South Asians.]
    1/61 = 1.6% J2b2-M241
    11/61 = 18.0% N-M231
    24/61 = 39.3% O-M175
    1/61 = 1.6% R-M17

    The Xibe Y-DNA pool seems to be overall quite similar to the Manchu Y-DNA pool as one might expect according to history and linguistic phylogeny. Both populations appear to be genetically intermediate between Koreans/Northern Chinese on one side and indigenous Siberians (including their linguistic relatives) on the other. Both populations have historically recent origins in the region that is now Northeast China, so their Y-DNA seems to agree with the generally observed high correlation between geography and genetics. However, different studies of these populations have found greatly differing frequencies of certain haplogroups, most notably C-M48 and O-M176, so either the sampling in at least some of the studies must be inadequate or the real populations themselves must be inhomogeneous.

    Data regarding Xibe mtDNA are insufficient. There is not much information available regarding Manchu mtDNA, either, but what little is available suggests that they cannot be easily distinguished from other northern Chinese on the basis of mtDNA.

    Like other populations of northern China and Mongolia, both populations exhibit small amounts of Western Eurasian admixture in both male (e.g. J-12f2(xJ2-M172), J2a1h-M530, J2b2-M241, R1a1-SRY10831.2) and female lineages (e.g. T, U4a2a1). They may also contain small fractions of male-mediated Iranian- or South Asian-like admixture (e.g. C-M356 in some Xibe males, probable R2 and L in some Manchu males).

  5. HGDP01245 from the HGDP sample of Xibe does indeed belong to T1a-M70 according to Chuan-Chao Wang, Lei Shang, Hui-Yuan Yeh, and Lan-Hai Wei, “The Consistencies of Y-Chromosomal and Autosomal Continental Ancestry Varying among Haplogroups,” J Forensic Sci Med 2016;2:229-32. Therefore, T-M70 should be counted among the Western Eurasian Y-DNA haplogroups that have been observed among the Xibe; others are J2a1h-M530, J2b2-M241, J-M304(xJ2-M172), and R1a1a-M17.

    I have collected the following data regarding the Y-DNA of Northern Tungusic/Ewenic speakers in Siberia:

    Even from Eveno-Bytantaysky National district and Momsky district of Sakha Republic Y-DNA (Fedorova et al. 2013)
    10/24 = 41.7% C-M48
    10/24 = 41.7% N-Tat
    2/24 = 8.3% N-P43
    1/24 = 4.2% J-12f2
    1/24 = 4.2% R-M269

    Even from Sakkyryyr, Eveno-Bytantay region, Sakha Republic Y-DNA (Pakendorf et al. 2007, Duggan et al. 2013)
    1/25 = 4.0% C3c-M48(xM86)
    4/25 = 16.0% C3c1-M86
    1/25 = 4.0% N1b-P43
    19/25 = 76.0% N1c-Tat

    Even from Sebjan-Küöl [i.e. Sebyan-Kyuyol], Kobyaysky District, Sakha Republic Y-DNA (Pakendorf et al. 2007, Duggan et al. 2013)
    1/14 = 7.1% C-M130(xM217) [Although this individual’s Y-chromosome has been classified by the authors as C(xC3), it shares an identical 11-marker Y-STR haplotype with the Y-DNA of the C3c*-M48(xM86) Even from Sakkyryyr. Most likely, these two individuals both belong to C-M48(xM86).]
    4/14 = 28.6% C3c1-M86
    9/14 = 64.3% N1c-Tat

    Even from Tompo District, Sakha Republic Y-DNA (Pakendorf et al. 2007, Duggan et al. 2013)
    1/28 = 3.6% C3-M217(xM48)
    13/28 = 46.4% C3c1-M86
    12/28 = 42.9% N1b-P43 [“all the Tompo Evens carrying haplogroup N1b share the same STR haplotype”]
    2/28 = 7.1% N1c-Tat

    Even from Berezovka, Srednekolymsky District, Sakha Republic Y-DNA (Duggan et al. 2013)
    7/7 = 100% C3c1-M86

    Even from Magadan Oblast Y-DNA (Karafet et al. 2002, Tambets et al. 2004, Hammer et al. 2006)
    4/31 = 12.9% C-M217(xM86)
    19/31 = 61.3% C-M86
    1/31 = 3.2% I-P19
    4/31 = 12.9% N-M178
    1/31 = 3.2% Q-P36
    2/31 = 6.5% R1a-SRY10831.2

    Even from Kamchatka Y-DNA (Duggan et al. 2013)
    15/15 = 100% C3c1-M86

    Even Y-DNA total (Karafet et al. 2002/Tambets et al. 2004/Hammer et al. 2006 + Fedorova et al. 2013 + Duggan et al. 2013)
    79/144 = 54.9% C-M130 [mostly C-M86, though there are also some cases of C-M48(xM86) and C-M217(xM48)]
    44/144 = 30.6% N-Tat [mostly of the Yakut type; especially frequent among Evens in Yakutia]
    15/144 = 10.4% N-P43 [especially frequent among Evens in Tompo District of eastern Yakutia, though all cases sampled there share an identical Y-STR haplotype]
    1/144 = 0.7% I-P19
    1/144 = 0.7% J-12f2
    1/144 = 0.7% Q-P36
    2/144 = 1.4% R1a-SRY10831.2
    1/144 = 0.7% R1b-M269

    Negidal Y-DNA (Lell et al. 2002)
    2/17 = 11.8% C-M130(xM48)
    9/17 = 52.9% C-M48
    6/17 = 35.3% N-Tat

    Evenk from Taimyr Y-DNA (Duggan et al. 2013)
    8/18 = 44.4% C3c1-M86
    7/18 = 38.9% N1b-P43
    3/18 = 16.7% R1a

    Evenk/Siberia, middle reaches of the Nizhnyaya Tunguska River according to map (Derenko et al. 2006)
    20/50 = 0.400 C-RPS4Y
    9/50 = 0.180 N1-LLY22g(xN1c1-Tat)
    8/50 = 0.160 N1c1-Tat
    7/50 = 0.140 R1a1a-M17
    3/50 = 0.060 R1-M173(xR1a1a-M17)
    2/50 = 0.040 F-M89(xG-M201, H1-M52, I-M170, J-12f2, K-M9)
    1/50 = 0.020 I-M170

    Evenk from Stony Tunguska River basin Y-DNA (Pakendorf et al. 2006, Duggan et al. 2013)
    28/40 = 70.0% C3c1-M86
    1/40 = 2.5% I-M170
    11/40 = 27.5% N1b-P43

    Yenisey Evenk Y-DNA (Lell et al. 2002)
    18/31 = 58.1% C-M48
    1/31 = 3.2% O-M119
    6/31 = 19.4% K-M9(xM119, Tat, M45) [likely N-P43]
    3/31 = 9.7% N-Tat
    3/31 = 9.7% R-M17

    Evenk from Iengra River basin Y-DNA (Pakendorf et al. 2007, Duggan et al. 2013)
    2/9 = 22.2% C3c*-M48(xM86)
    4/9 = 44.4% C3c1-M86
    2/9 = 22.2% N1c-Tat
    1/9 = 11.1% O-M175

    Evenk/Ust-Maysky, Oleneksky, and Zhigansky districts of Sakha Republic (Fedorova et al. 2013)
    15/57 = 26.3% C3b2-M48
    3/57 = 5.3% C3-M217(xC3b2-M48, C3e1a-M407)
    29/57 = 50.9% N1c1-Tat
    5/57 = 8.8% N1c2b-P43
    3/57 = 5.3% R1a1a-M198(xR1a1a1b1a1-M458)
    1/57 = 1.8% R1a1a1b1a1-M458
    1/57 = 1.8% I2a1-P37

    Siberian Evenk Y-DNA [Karafet et al. 2001, Hammer et al. 2006) [Apparently, most of these Evenk DNA samples were collected in the Nyukzha River basin.]
    13/95 = 13.7% C3-M217(xC3b2a-M86)
    52/95 = 54.7% C3b2a-M86
    16/95 = 16.8% N1c1a-M178
    2/95 = 2.1% N1c2b-P43
    5/95 = 5.3% I-P19
    2/95 = 2.1% J-12f2 [Hammer et al. 2006] or 1/95 = 1.1% F-P14(xI-P19, G2a-P15, J-p12f2, K-M9) and 1/95 = 1.1% J2-M172 [Karafet et al. 2001]
    4/95 = 4.2% Q-P36(xM3)
    1/95 = 1.1% R1a1-SRY10831.2

    Okhotsk Evenk Y-DNA (Lell et al. 2002) [“Samples from a geographically isolated group of Evenks were collected in several small settlements on the mainland Okhotsk Sea shore in the Tugur-Chumikan District of the Khabarovsk Region.”]
    10/16 = 62.5% C-M48
    6/16 = 37.5% N-Tat

    Siberian Evenk Y-DNA total (Lell et al. 2002 + Hammer et al. 2006 + Derenko et al. 2006 + Fedorova et al. 2013 + Duggan et al. 2013)
    173/316 = 54.7% C-M130 [mostly C-M86, but there are also some cases of C-M48(xM86) and C-M217(xM48) like the Evens]
    64/316 = 20.3% N-M46 [tends to be more frequent toward the east, in Evenks from Yakutia and the shores of the Sea of Okhotsk]
    40/316 = 12.7% K-M9(xM119, Tat, M45) [appears to be all N-P43; tends to be more frequent toward the west, in Evenks from Taimyr Peninsula and the basin of the Yenisei River]
    18/316 = 5.7% R1a
    8/316 = 2.5% I-P19/M170
    4/316 = 1.3% Q-P36(xM3)
    3/316 = 0.95% R1-M173(xM17)
    3/316 = 0.95% F(xG2a-P15, I-P19/M170, J-12f2, K-M9) [one of these may actually belong to J-12f2(xJ2-M172)]
    1/316 = 0.32% J2-M172
    1/316 = 0.32% O-M175 with subclade undetermined
    1/316 = 0.32% O-M119

    It appears that most Ewenic-speaking males belong to haplogroup C-M86. The TMRCA of C-M86 is currently estimated by YFull to be 3,800 [95% CI 3,100 4,600] ybp. The rest of the population seems to consist of assimilated descendants of other indigenous Siberians (perhaps Yakuts, Samoyeds, Yukaghirs, and Koryaks), non-indigenous Russians/Soviets, and perhaps a few Mongols.


Leave a Reply

Your email address will not be published. Required fields are marked *