Indian ancestry in Southeast Asia is older than statistical genetic tests suggest

The panels above are from a new preprint, Reconstructing the human genetic history of mainland Southeast Asia: insights from genome-wide data from Thailand and Laos. It’s an OK preprint, marked mostly by the inclusion of a lot of samples from Thailand. The “southern Thai” samples are from peninsular Thailand, and there are Malays in there. The “central Thai” samples are from in and around Bangkok. The Mon seems to be sampled from Thailand as well.

Most of the papers on mainland Southeast Asian genetics are hard to follow because there isn’t a clear relationship in many cases between language and genetics, and linguistic classification can be dodgy. E.g., is Vietnamese Austro-Asiatic? The biggest difference is the old “Australo-Melanesian” substrate, and the ancestry brought by the farmers from the north. But these farmers themselves come out of a southern Chinese milieu where there isn’t a distinction. The biggest difference between a lot of the “Austro-Asiatic” and “Tai-Kadai” groups is how much Australo-Melanesian (Hoabinhian) ancestry they carry (the former carry more since they arrived earlier).

But the question of “Indian ancestry” is more interesting and a bit clearer. It seems obvious that a lot of Southeast Asian groups have South Asian ancestry. For twenty years it’s been clear that the HGDP Cambodian has a West Eurasian affinity, and many of us assumed it was simple “Ancestral South Indian” (ASI) shared lineage. Basically, the people from India to the South China sea were part of a genetic continuum before the intrusion of West Asians into South Asia and Northeast Asians into Southeast Asia. But this is wrong. The Indian ancestry clearly exhibits “Ancestral North Indian” heritage. In Cambodia itself on the order of 5% of the men seem to carry Y haplogroup R1a1a. This is steppe-associated.

So the question is when did this come into the region? The preprint’s figure is a little misleading, though in the text it’s clearer: the statistics indicate a major admixture ~750 years ago. The Mon in particular have lots of Indian ancestry. 20% is probably a low bound figure for this group. When I ran ALDER I got about 750 years for Cambodia. There is zero chance that there was a large scale migration of Indians into Cambodia at that date. Unlike proto-Burma, Cambodia is also pretty far from mainland India.

The most plausible explanation is that these admixture dates are picking up the mixing between a Southeast Asian set of populations without much Indian ancestry, and a group of Austro-Asiatic people who had a lot of Indian ancestry from an earlier admixture.

  1. “E.g., is Vietnamese Austro-Asiatic?”

    Literally no historical linguist who works with Southeast Asian languages and isn’t a complete crackpot thinks Vietnamese is anything but Austroasiatic.

  2. The main point of contention in SE Asian language classification is whether or not Austroasiatic, Austronesian, and/or Tai-Kadai languages are related. It’s gradually growing more accepted that Tai-Kadai is either a sister clade or sub-branch of Austronesian. From what I’ve seen of the evidence, it’s rather convincing. There are some newly studied, archaic Tai-Kadai languages in China that look a lot more like Austronesian than the other languages in the family.

    Someone on Wikipedia made an attempt to associate language phyla with subclades of Y haplogroup O. Is this complete BS, or does it look plausible?

  3. What are some of the estimated ages for proto-Austroasiatic, proto-Austronesian, proto-Sino-Tibetan, proto-Tai-Kadai? I was reading up on Afro-Asiatic not too long ago and there are some proposals by apparently serious linguists suggesting it’s up to 13,000 years old, which seems crazy to me, so I’ve been wondering if there are other language families out there of comparable time depth. Proto-Indo European is only postulated to be about 5,000ish years I think.

  4. @Mick, re odd early dates, and why those are offered, here’s the situation as I see it as to why that happens (answering at length). There are basically two methods to try and date languages. 1) lexicostatistics, which is decay of cognate terms over time, more or less, analysed by statistical methods to then infer the most probable (maximum parsimony or likelihood) real language tree from observed data. Cognates are taken from “human universals” lists to avoid attrition due to crossing geographic or cultural barriers. 2) “linguistic paleontology”, which is identifying terms cognate in languages, eliminating possibility these are later loanwords and secondary spread, and then using a bundle of these to date a divergence time (all words X date from later than time Y, therefore expansion happened later than time Y).

    (You can’t do it by correlating the number of sound and grammar changes in languages, because these are interdependent and display very variable rates of change, an in theory could exhibit very low change over long time scales, or high change over short timescales. And also because this has more assumptions about the ancestral state baked in, I think.).

    Lots of people don’t like the former LS methods (using with some ranting conflating with “glottochronology”), despite that it seems to me to often give inuitive results for many datasets. (This is bound up with historical linguistics tendency not to trust “black box” computational methods from outside linguistics, where the actual data itself and the methods being relatively hidden). The latter is more well received in historical linguistics, but has been hard to establish for many language families like AA, ST, AN (in part because of limited study and also due to disagreement around the real branching order, which can be crucial to assess the claim).

    In the absence of either dating method, this opens the possibility of these postulates of very early divergence times. Even if we find ideas of early divergence times strange, with no dating methods and no clearly agreed tree structure it can become a bit of an open season. (Despite ancient DNA, early dates can still be argued given some clear degree in the historical record of independence of autosomal and uniparental genetics and languages. As much as, yes, this can seem a bit “pushing our credibility too far” when it gets to scenarios that seem to combine very old, widespread linguistic divergence with young genetic divergence).

    Re datings of those language families you’ve mentioned, there are a fair number of recent lexicostatistic papers on Sino-Tibetan (most recent from November this year is here – and also Austronesian. Some Austroasiatic out there too.

  5. The Austronesian expansion from Taiwan began about 3000 BCE, and this expansion of one of the indigenous languages of Formosa is arguably really what defines this language family. The indigenous Formosan languages are probably all derived from the Neolithic Dapenkeng culture that abruptly appeared and quickly spread around the coast of the island around 4000 BCE to 3000 BCE (displacing prior Negrito hunter-gatherer populations), only preceding the migration of one of those cultures to other islands by within the margin of error of available dating methods. The particular archeological culture of mainland South China from which this culture was derived is unresolved, in part because archaeological data from that time period is sparse and undeveloped.

    “the main ancestry of high-altitude Tibeto-Burman speakers originated from the ancestors of Houli/Yangshao/Longshan ancients in the middle and lower Yellow River basin, consistent with the common North-China origin of Sino-Tibetan language and dispersal pattern of millet farmers.” This conclusion is contrary to many 20th century and early 21st century proposals (putting a homeland in Northeast India or Southern China) but now probably represents conventional wisdom in the field.

    So, the Sino-Tibetan languages date the North Chinese Neolithic Revolution which was millet farming and independent in origin from South Chinese and Southeast Asian rice farming. The earliest archaeological culture in this region is the Nanzhuangtou culture which started around 8500 BCE, but it isn’t clear that that culture was in linguistic continuity with the cultures that gave rise to the Sino-Tibetan language family. Two major studies in 2019 favor this model but assign a more recent origin to the language family than the first Neolithic culture of the region:

    “Zhang et al. (2019) performed a computational phylogenetic analysis of 109 Sino-Tibetan languages to suggest a Sino-Tibetan homeland in northern China near the Yellow River basin. The study further suggests that there was an initial major split between the Sinitic languages and the Tibeto-Burman languages approximately 4,200 to 7,800 years ago (with an average of 5,900 years ago), associating this expansion with the Yangshao culture and/or the later Majiayao culture. Sagart et al. (2019) also performed another phylogenetic analysis based on different data and methods to arrive at the same conclusions with respect to the homeland and divergence model, but proposed an earlier root age of approximately 7,200 years ago, associating its origin with the late Cishan and early Yangshao culture.”

    The Northern millet farmers and Southern Rice farmers started to integrate into a common culture around 3500 years ago.

    The Austroasiatic language family is probably derived from Southern Chinas Neolithic Rice farmers who trace their culture origins to the domestication of Chinese rice in an independent domestication event. The expansion from South China to Southeast Asia took place around 4,000 years ago and close in time to the spread of the Austronesians in Southeast Asia. They displaced Hoabinhian hunter-gatherer populations in Southeast Asia. See Ancient DNA shows populations genetically similar to modern Austro-Asiatic populations in Vietnam, Laos, and mainland Malaysia by 2200 BCE. But the archaeological record is thin in the relevant time period from the South Chinese Neolithic revolution to 2200 BCE, so dating it is tricky. But the oldest ancient DNA may have been from close to the time that the language family arrived there since: “The spread of japonica rice cultivation to Southeast Asia started with the migrations of the Austronesian Dapenkeng culture into Taiwan between 3500 and 2000 BC (5,500 BP to 4,000 BP). The Nanguanli site in Taiwan, dated to ca. 2800 BC, has yielded numerous carbonized remains of both rice and millet in waterlogged conditions, indicating intensive wetland rice cultivation and dryland millet cultivation. A multidisciplinary study using rice genome sequences indicate that tropical japonica rice was pushed southwards from China after a global cooling event (the 4.2k event) that occurred approximately 4,200 years ago.”.

    Rice was domesticated in Southern China around 7400 BCE. But that doesn’t mean that the initial rice domesticating culture was in linguistic continuity with the first Austro-Asiatic populations. And the time depth of the various language of the region is muddy.

    “There are two most likely centers of domestication for rice as well as the development of the wetland agriculture technology. The first, and most likely, is in the lower Yangtze River, believed to be the homelands of the pre-Austronesians and possibly also the Kra-Dai, and associated with the Kauhuqiao, Hemudu, Majiabang, Songze, Liangzhu, and Maqiao cultures. It is characterized by pre-Austronesian features, including stilt houses, jade carving, and boat technologies. Their diet were also supplemented by acorns, water chestnuts, foxnuts, and pig domestication.

    The second is in the middle Yangtze River, believed to be the homelands of the early Hmong-Mien-speakers and associated with the Pengtoushan, Nanmuyuan, Liulinxi, Daxi, Qujialing, and Shijiahe cultures. Both of these regions were heavily populated and had regular trade contacts with each other, as well as with early Austroasiatic speakers to the west, and early Kra-Dai speakers to the south, facilitating the spread of rice cultivation throughout southern China.

    By the late Neolithic (3500 to 2500 BC), population in the rice cultivating centers had increased rapidly, centered around the Qujialing-Shijiahe culture and the Liangzhu culture. Liangzhu and Shijiahe declined abruptly in the terminal Neolithic (2500 to 2000 BC). With Shijiahe shrinking in size, and Liangzhu disappearing altogether. This is largely believed to be the result of the southward expansion of the early Sino-Tibetan Longshan culture. … This period also coincides with the southward movement of rice-farming cultures to the Lingnan and Fujian regions, as well as the southward migrations of the Austronesian, Kra-Dai, and Austroasiatic-speaking peoples to Mainland Southeast Asia and Island Southeast Asia. A genomic study also indicates that at around this time, a global cooling event (the 4.2 k event) led to tropical japonica rice being pushed southwards, as well as the evolution of temperate japonica rice that could grow in more northern latitudes.”

    Y-DNA evidence suggests that Austronesians are an out group to the other major Southeast Asian and East Asian language families (Sino-Tibetan, Austroasiatic, Hmong-Mien, Mon-Khmer, Thai-Kra-Dai), with people speaking those languages linked to Austronesians only at a greater time depth.

    A genetically based conclusion that the Hmong-Mien peoples are a comparatively recent offshoot of the Mon-Khmer peoples, something that linguistic analysis has not reached consensus upon, is a finding of considerable importance in parsing out the prehistory of Southeast Asia, even though the hypothesis that Mon-Khmer and Hmong-Mien both belong to a macrolinguistic family (sometimes called Yangtzean) has existed for some time as one of several efforts to link the languages of South China and Southeast Asia into linguistic macrofamilies. An expanded study of Y-DNA population genetics in Southeast Asia by Chinese researchers, confirms that close genetic ties of Mon-Khmer (a.k.a. Austro-Asiatic when the Munda of India are also included) and Hmong-Mien peoples, at least in the patriline, focusing on the O3a3b-M7 Y-DNA haplogroup where the Mon-Khmer appearing at a basal position while Hmong-Mien and Tibeto-Burmese individuals with this hapologroup have subhaplogroups more on the fringes of this patriline tree. O3a3c1-M117, the dominant East Asian haplogroup shows a similar pattern. Cai X, et al. “Human Migration through Bottlenecks from Southeast Asia into East Asia during Last Glacial Maximum Revealed by Y Chromosomes.” PLoS ONE 6(8): e24282 (2011). doi:10.1371/journal.pone.0024282

  6. @ Matt,

    The paper you cited dates Sino-Tibetan to 8,000 years ago, seems pretty old but still well short of the typical estimates for proto-Afro-Asiatic. It would be interesting to see another group of researchers using the same method from that paper and applying it to AA and see what kind of dates they come up with. The coherence of AA as a genetic unity from what I gather is due to some very basic/elemental morphological features shared in common between its various sub-families which are unlikely to be randomly evolved chance correspondences or due to areal contact. Any kind of actual common lexicon seems to be very poorly preserved, because the language family is supposed to be so old. The comparative method in historical linguists is only supposed to work up to 10,000 years, or so I’ve heard. I guess I’m wondering if there’s another language family out there analogous to AA in that it’s only really able to be defined by basic morphological characteristics, but has been dated (via whatever other means) to a younger age than say 10,000 years.

