Substack cometh, and lo it is good. (Pricing)

India vs. China, genetically diverse vs. homogeneous

About 36% of the world’s population are citizens of the Peoples’ Republic of China and the Republic of India. Including the other nations of South Asia (Pakistan, Bangladesh, etc.), 43% of the population lives in China and/or South Asia.

But, as David Reich mentions in Who We Are and How We Got Here China is dominated by one ethnicity, the Han, while India is a constellation of ethnicities. And this is reflected in the genetics. The relatively diversity of India stands in contrast to the homogeneity of China.

At the current time, the best research on population genetic variation within China is probably the preprint A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese. The author used low-coverage sequencing of over 10,000 women to get a huge sample size of variation all across China. The PCA analysis recapitulated earlier work. Genetic relatedness among the Han of China is geographically structured. The largest component of variance is north-south, but a smaller component is also east-west. The north-south element explains more than 4.5 times the variance as the east-west.

Click to enlarge

Another dimension of the of the variation is that different parts of China are character by different levels of admixture between the Han and other groups. In Northwest China, there is gene flow from West Eurasian sources. In all likelihood, this is through proxy populations, such as Mongols, who are about ~10% West Eurasian. Also, during the period between the fall of the Han Dynasty and the rise of the Sui-Tang Dynasty much of northern China was dominated by barbarian groups from the steppe, and these groups settled down and were absorbed. In Northeast China, the source of admixture is from Siberian and Tungusic group. Again, this makes geographical sense.

In contrast in South China, the gene flow is from indigenous Chinese national groups, such as Dai. This is in keeping with the historical record, whereby South China became Han in the period between 0 and 1000 AD through migration, intermarriage, and acculturation.

Click to enlarge

I have my own small private dataset of Chinese individuals. Some with provenance. Some without. But using known populations I was able to divide China along the north to south cline.  Individuals from Guangdong in the south, those from Shaanxi in the north, and from Zhejiang to Sichuan in the center.

Using Punjabis as a West Eurasian outgroup I was able to plot these individuals on a PCA. If you click to enlarge you will see that a substantial minority of the Han_N sample is shifted to the left of the plot. This is toward the Punjabis. This is not because they have Punjabi ancestry, but because Punjabis are reasonable proxies for West Eurasians.

Click to enlarge

More importantly, I want to compare South Asia to China. To do that I created a small dataset that merged the Han with representative South Asian groups. The first PC, 1 and 2, illustrate the contrast. All three Chinese groups, sampled from the north to the south, occupy a very tight cluster, while the South Asians span PC 2. The Bengalis are shifted a bit to the Chinese, but most of the variance is due to within-South Asian genetic differences.

Click to enlarge

I ran PCA to 10 dimensions. Only at PC 10 did the Han Chinese separate along the north-south access. Most of the earlier PC’s separated out specific castes (e.g, Patels because if their large number in the Gujurati sample were PC 3). Here are the eigenvalues: 53.0682, 2.5641, 2.31876
1.97058, 1.90652, 1.88879, 1.7935, 1.69375, 1.61516, and 1.54207. The large value for PC 1 is what you’d expect, it’s a continental scale difference. PC 2 differentiates South Asia from north to south. It’s much more modest. The other PCs get progressively smaller, but within the data, it’s clear that the continental size difference is the big one. The variance between north and south China is a small one in a South Asian scale.

Click to enlarge

Pairwise Fst is more ambiguous. That’s probably because most of the South Asian samples have structure within them. Merging them into one pooled population just confuses the issue.

Using a South Asian dataset where groups are disaggregated makes a lot more sense, and you see the structure between the different groups.

Click to enlarge

Running Treemix gives similar results. The South Asian groups exhibit a fan-shaped topology, where the Han cluster tightly together. Since I removed Bengalis from Treemix adding migration edges doesn’t do anything between the two clusters, so I omitted those results.

Click to enlarge

Finally, of course I ran some admixture analysis. Using South Asians + Han Chinese, I thought K = 4 would be reasonable. Even if you don’t enlarge, the results are straightforward: the Han Chinese have very little diversity in unsupervised mode. A small South Asian-like component, which has affinities with Punjabis, is found in northern Han. This confirms other results with other methods that the northern Han have some West Eurasian gene flow.  Some of the southern and central Han have an affinity with one of the South Indian clusters. I think is artifactual, due to deep structure within Eastern Eurasian populations and affinities between those groups that the Han absorbed as they moved south.

This post doesn’t really shed new light on anything we didn’t know. Rather, it’s just a review of what jumps out at anyone who works with genotype data: there is not very much genetic diversity in China and there is a great deal of genetic diversity in India. Why? These are not questions genetics can really answer directly, though it can give us clues and support certain models over others.

Anyone who has read much about Chinese history knows that the cultural ideal of meritocracy is deeply ingrained, even if it is honored in the breach quite often. Chinese civilizations has been characterized by the domination of extended pedigrees (e.g., the Xianbei-Han ruling faction among the Tang), but those pedigrees never become ethno-religious castes. The exception occurred during the Yuan (Mongol) period where Kublai Khan entered into a divide-and-rule policy. But that was a short period which had no longer term cultural consequences.

In contrast, South Asia is characterized by long-term endogamy. This is not surprising to anyone who knows anything about South Asian history. The genetic evidence suggests that modern jati-barriers emerged around ~2,000 years ago. Not only do South Asian groups differ a great deal in biogeographic ancestry (deep ancestry), but historical endogamy has resulted in further drift between these groups.

Addendum: It has been brought to my attention that my reference to “genetic variation” was not entirely clear. If you look at the 1000 Genomes Paper you will see that the South Asian populations tend to have a modest amount more SNP-diversity than the East Asian populations. But I am focusing on between population genetic variation. This can be a bit confusing. For example, the Bantu-speaking populations in Africa exhibit the requisite high level of variation on the individual level in their genomes…but between the groups, the differences are quite modest. This is in contrast to Khoisan groups in South Africa. Why the difference? Because the Bantu expansion is relatively recent and from a particular founder group (with some later assimilation of local populations). The Khoisan have a lot of old structure in addition to their very high genome-wide diversity (some of which is due to admixture though).

In the context of India and China, there are three things to consider

  1. Indian populations exhibit recent ancestry (2-10,000 years) from very different Eurasian lineages. This inflates their within-genome diversity even if you take into account an Out-of-Africa serial founder effect.
  2. Indians exhibit incomplete/variable mixing across these lineages due to geography and strong endogamy on social status. This inflates their geographic and jati structure. To be concrete, Tamil Brahmins are genetically somewhat closer to North Indian Brahmins than other Tamils, despite speaking the same language as other Tamils and living in the same areas (probably for around 1,000 years or more).
  3. In contrast, despite being over 1 billion people over the expanse of eastern China the Han people are only modestly differentiated. This is probably due to lack of caste-like structure and repeated migrations between north and south over dynasty periods.

5 thoughts on “India vs. China, genetically diverse vs. homogeneous

  1. Genetically, I am almost same as people from Inner Mongolia (so-called golden family). Culturally, I am identified as northern Han Chinese. My hometown is less than 200 miles from inner Mongolia grassland. My village is known as Sinicized Mongolian village during early Ming Dynasty. Incidentally, the Ming emperor chengzu had a Mongolian mother himself.

    A lot of Mongolian actually joined Ming army during Ming dynasty. Mongolian were allowed to stay in China on condition to be Sinicized. Basically, non-Han people are not allowed to marry each other. Violators would be enslaved by Ming Government.

    查大明律之六 婚姻,「凡蒙古色目人,聽與中國人為婚姻,務要兩相情願,不許本類自相嫁娶,違者丈八十,男女入官為奴。其中國人不願與回回欽察為婚姻者,聽從本類自相嫁娶,不在禁限」

    Volunteer or forced genetic integration has been tradition in northern China. Northern people (Han, Mongolian, Manchurian) all display humble modest personality which fit Confucius principle naturally.

  2. Thank you for the work.

    Now that we have India vs China, how does Western Eurasia vs India vs China fit in terms of genetic diversity?

  3. Not to split hairs here, but even though Han comprise 91.5% of the Chinese population, it isn’t really an apples-to-apples comparison to look at the Han next to the full range of South Asian ethnic diversity. Han may = China to a first approximation, but not past that.

    Of course, as you intimate, the major reason why China is so distinctive from South Asia is because the Han became so omnipresent. That said, given there are many different local “flavors” of genetic diversity which had their ultimate origin in Southern China (Austro-Asiatic, Austronesian, Tai, etc) it may ultimately be true that the process of ethnogenesis in East Asia was more complicated than the relatively simple three-way admixture of most South Asian groups.

  4. Interesting how the Central Han samples exhibit north-south structure on PC10. Could you provide a list of provinces the Chinese samples represent?

  5. Absolutely much more differentiation between South Asian populations (and bulk of South Asia) than between Chinese populations and bulk of China), and lack of endogamy accelerating differentiation.

    That said, on finer points, re: precise order of PCs, how affected might this be by sampling count of total n Han Chinese vs total n South Asian? Eigenvector size should be affected by sampling. Could argue that equal numbers of individuals from each *group* is more a criterion than equal numbers of South Asian and Chinese individuals, but that is arguably circular. Hard to straightforwardly interpret eigenvector size in distance terms – e.g. PC2 = 2.56 approximates PC3 = 2.318… but surely Gujurati Patels (PC3) not as differentiated from other populations as whole Indian cline (PC2)?

Comments are closed.