About 36% of the world’s population are citizens of the Peoples’ Republic of China and the Republic of India. Including the other nations of South Asia (Pakistan, Bangladesh, etc.), 43% of the population lives in China and/or South Asia.
But, as David Reich mentions in Who We Are and How We Got Here China is dominated by one ethnicity, the Han, while India is a constellation of ethnicities. And this is reflected in the genetics. The relatively diversity of India stands in contrast to the homogeneity of China.
At the current time, the best research on population genetic variation within China is probably the preprint A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese. The author used low-coverage sequencing of over 10,000 women to get a huge sample size of variation all across China. The PCA analysis recapitulated earlier work. Genetic relatedness among the Han of China is geographically structured. The largest component of variance is north-south, but a smaller component is also east-west. The north-south element explains more than 4.5 times the variance as the east-west.
Another dimension of the of the variation is that different parts of China are character by different levels of admixture between the Han and other groups. In Northwest China, there is gene flow from West Eurasian sources. In all likelihood, this is through proxy populations, such as Mongols, who are about ~10% West Eurasian. Also, during the period between the fall of the Han Dynasty and the rise of the Sui-Tang Dynasty much of northern China was dominated by barbarian groups from the steppe, and these groups settled down and were absorbed. In Northeast China, the source of admixture is from Siberian and Tungusic group. Again, this makes geographical sense.
In contrast in South China, the gene flow is from indigenous Chinese national groups, such as Dai. This is in keeping with the historical record, whereby South China became Han in the period between 0 and 1000 AD through migration, intermarriage, and acculturation.
I have my own small private dataset of Chinese individuals. Some with provenance. Some without. But using known populations I was able to divide China along the north to south cline. Individuals from Guangdong in the south, those from Shaanxi in the north, and from Zhejiang to Sichuan in the center.
Using Punjabis as a West Eurasian outgroup I was able to plot these individuals on a PCA. If you click to enlarge you will see that a substantial minority of the Han_N sample is shifted to the left of the plot. This is toward the Punjabis. This is not because they have Punjabi ancestry, but because Punjabis are reasonable proxies for West Eurasians.
More importantly, I want to compare South Asia to China. To do that I created a small dataset that merged the Han with representative South Asian groups. The first PC, 1 and 2, illustrate the contrast. All three Chinese groups, sampled from the north to the south, occupy a very tight cluster, while the South Asians span PC 2. The Bengalis are shifted a bit to the Chinese, but most of the variance is due to within-South Asian genetic differences.
I ran PCA to 10 dimensions. Only at PC 10 did the Han Chinese separate along the north-south access. Most of the earlier PC’s separated out specific castes (e.g, Patels because if their large number in the Gujurati sample were PC 3). Here are the eigenvalues: 53.0682, 2.5641, 2.31876
1.97058, 1.90652, 1.88879, 1.7935, 1.69375, 1.61516, and 1.54207. The large value for PC 1 is what you’d expect, it’s a continental scale difference. PC 2 differentiates South Asia from north to south. It’s much more modest. The other PCs get progressively smaller, but within the data, it’s clear that the continental size difference is the big one. The variance between north and south China is a small one in a South Asian scale.
Pairwise Fst is more ambiguous. That’s probably because most of the South Asian samples have structure within them. Merging them into one pooled population just confuses the issue.
Using a South Asian dataset where groups are disaggregated makes a lot more sense, and you see the structure between the different groups.
Running Treemix gives similar results. The South Asian groups exhibit a fan-shaped topology, where the Han cluster tightly together. Since I removed Bengalis from Treemix adding migration edges doesn’t do anything between the two clusters, so I omitted those results.
Finally, of course I ran some admixture analysis. Using South Asians + Han Chinese, I thought K = 4 would be reasonable. Even if you don’t enlarge, the results are straightforward: the Han Chinese have very little diversity in unsupervised mode. A small South Asian-like component, which has affinities with Punjabis, is found in northern Han. This confirms other results with other methods that the northern Han have some West Eurasian gene flow. Some of the southern and central Han have an affinity with one of the South Indian clusters. I think is artifactual, due to deep structure within Eastern Eurasian populations and affinities between those groups that the Han absorbed as they moved south.
This post doesn’t really shed new light on anything we didn’t know. Rather, it’s just a review of what jumps out at anyone who works with genotype data: there is not very much genetic diversity in China and there is a great deal of genetic diversity in India. Why? These are not questions genetics can really answer directly, though it can give us clues and support certain models over others.
Anyone who has read much about Chinese history knows that the cultural ideal of meritocracy is deeply ingrained, even if it is honored in the breach quite often. Chinese civilizations has been characterized by the domination of extended pedigrees (e.g., the Xianbei-Han ruling faction among the Tang), but those pedigrees never become ethno-religious castes. The exception occurred during the Yuan (Mongol) period where Kublai Khan entered into a divide-and-rule policy. But that was a short period which had no longer term cultural consequences.
In contrast, South Asia is characterized by long-term endogamy. This is not surprising to anyone who knows anything about South Asian history. The genetic evidence suggests that modern jati-barriers emerged around ~2,000 years ago. Not only do South Asian groups differ a great deal in biogeographic ancestry (deep ancestry), but historical endogamy has resulted in further drift between these groups.
Addendum: It has been brought to my attention that my reference to “genetic variation” was not entirely clear. If you look at the 1000 Genomes Paper you will see that the South Asian populations tend to have a modest amount more SNP-diversity than the East Asian populations. But I am focusing on between population genetic variation. This can be a bit confusing. For example, the Bantu-speaking populations in Africa exhibit the requisite high level of variation on the individual level in their genomes…but between the groups, the differences are quite modest. This is in contrast to Khoisan groups in South Africa. Why the difference? Because the Bantu expansion is relatively recent and from a particular founder group (with some later assimilation of local populations). The Khoisan have a lot of old structure in addition to their very high genome-wide diversity (some of which is due to admixture though).
In the context of India and China, there are three things to consider
- Indian populations exhibit recent ancestry (2-10,000 years) from very different Eurasian lineages. This inflates their within-genome diversity even if you take into account an Out-of-Africa serial founder effect.
- Indians exhibit incomplete/variable mixing across these lineages due to geography and strong endogamy on social status. This inflates their geographic and jati structure. To be concrete, Tamil Brahmins are genetically somewhat closer to North Indian Brahmins than other Tamils, despite speaking the same language as other Tamils and living in the same areas (probably for around 1,000 years or more).
- In contrast, despite being over 1 billion people over the expanse of eastern China the Han people are only modestly differentiated. This is probably due to lack of caste-like structure and repeated migrations between north and south over dynasty periods.