51WnxzkxMTL._SX312_BO1,204,203,200_The Indo-European Controversy: Facts and Fallacies in Historical Linguistics, by Asya Pereltsvaig and Martin Lewis is a pretty one-sided monograph. The reason, as admitted by the authors, is that they believe a certain sector of academia and the middle-brow reading public are not exhibiting enough skepticism about the application of Bayesian phylogenetics in linguistics. To a great extent The Indo-European Controversy: Facts and Fallacies in Historical Linguistics is a book length rejoinder to a paper published in 2003 to great acclaim, Language-tree divergence times support the Anatolian theory of Indo-European origin. It’s basically a short letter to Nature. When this paper came out I did not have academic access to such things, and there weren’t online resources (like Twitter) to allow one to make an end-around to academic paywalls. So I remember actually going down to the local college library, and getting a paper copy of the edition of Nature, and reading it just like that. In fact that may very well be the last scientific paper I read on paper. And, I went in search of that paper because of an article I saw in The New York Times by Nick Wade, A Biological Dig for the Roots of Language. Pereltsvaig and Lewis correctly peg Nick Wade’s influence in my opinion. My own passing interest in the topic was triggered by coverage in the media. That’s probably true for many people. There are other papers of note which follow in the tradition of the 2003 letter to Nature. In particular, I recommend Mapping the Origins and Expansion of the Indo-European Language Family, in Science. If you don’t know much about Bayesian phylogenetics, read the supplements.

5140FASZyJL._SX330_BO1,204,203,200_The heart of the argument Pereltsvaig and Lewis present seems to be that some key assumptions in the model that Bayesian phylogeneticists are using to make inferences about the emergence and spread of Indo-European languages are wrong. And, those incorrect assumptions lead to empirical results which are also wrong. Though it was difficult for me to follow much of the deep dive into technical linguistics (thanks for that Asya!), some of the problems with inferences are pretty easy to see. They note that in the supplements of the 2012 paper (second one above) the Romani language is placed as an outgroup to the other members of the Indo-Aryan family. This seems wrong to Pereltsvaig and Lewis, and from what I know it is wrong. Linguistic consensus is that Romani dialects are related to those of Northwest India. It turns out that the genetics favors this, as their South Asian ancestry does seem to derive from Northwest Indian populations. We can go on with details in this vein, and the authors do, assembling a list of fallacious inferences, but what’s the root of the problem?

One of the major weaknesses brought up in The Indo-European Controversy: Facts and Fallacies in Historical Linguistics is that these Bayesian phylogenetic models utilize lexical information as data inputs. In particular, a set of a few hundred cognates. There are two elements to the objection. First, the choice of cognates might be biased, or at least bias the output. Second, vocabulary may not be the best foundation on which to generate a phylogeny of language. Rather, something like grammar may be more phylogenetically informative. The authors of the above works under criticism actually state they’re trying to use grammar as an input too. But in any case, the tendency for vocabulary to be exchanged between nearby groups, irrespective of their phylogenetic origin, is presumably the reason that the Romani languages drifted far enough away from the other Indo-Aryan languages to seem like an outgroup. No matter how ingenious your method, if your input data is biased or not informative, your output is not likely to be useful. Pereltsvaig and Lewis allude to the fact that linguistics has not found their “atoms” yet. I’d state it differently: linguistics lacks its DNA sequence. Using a biological analogy, these linguistic applications of Bayesian phylogenetics are attempting to discern evolutionary history from phenotype.

9780631205661_lThe second major problem with the papers coming out of the Bayesian phylogenetic tradition in linguistic history is an incorrect model assumption: that populations expand purely through diffusion-like processes. If you read the detailed methods it’s pretty clear that they’re converging on the joint posterior probability of tree given the data as well as the geographic distribution assuming a demic diffusion framework. The Indo-European Controversy tackles extensively the historiography of migrations, or lack thereof. Before World War II archaeologists naively traced migrations through the change in cultural forms, while after World War II the backlash became so strong that the null was always that pots, rather than people, were on the move. And, when people were on the move in pre-state societies, it was envisaged in almost a mechanical fashion, as individuals on the farming frontier had higher fertility, and so endogenous growth simply swamped out other groups like European hunter-gatherers. Part of its appeal isn’t just ideological, it’s an elegant model. Historical detail and contingency isn’t relevant, and inter-group conflict can be sidestepped. It’s all about endogenous growth of a population assuming particular resources, until it hits a Malthusian limit in the locality.

Unfortunately this model is almost certainly wrong for human history. Ancient DNA has revolutionized everything, because it is shown just how punctuated demographic shifts can be. Ancient DNA reveals key stages in the formation of central European mitochondrial genetic diversity highlighted this dynamic a few years back. More recently, Population genomics of Bronze Age Eurasia and Massive migration from the steppe is a source for Indo-European languages in Europe indicate discontinuity. I want to emphasize the term discontinuity, as this is very different from gradual diffusion. Rather than a methodologically individualistic model, where higher fertility in farmsteads or at least villages gradually resulted in the transition from one group to another, a more likely in my opinion is inter-group tension, conflict, and amalgamation. In some cases, near total replacement. It may not have been always violent, rather, agriculturalists on the Malthusian margins may not have been able to withstand the shock of a new culture arriving and sequestering critical resources (an analogy I’m thinking is the massive collapse of Roman culture in the Balkans whenever the imperial limes withdrew toward the coasts; without state support and scaffold the way of like the Latin peasantry just wasn’t feasible, so they quickly migrated or died off).

For example, it looks as if the Uygurs are not descended in large part from the first Indo-Europeans on the fringes of western China. I took the data the Reich lab posted and ran TreeMix on it. After reducing the number of populations, I ran TreeMix on it. Below are 10 plots. The West Eurasian ancestry of the Uygurs is not overwhelmingly Northern European-like. Weirdly the graphs below suggest it is somewhat less Northern European than the West Eurasian ancestry contributing to the Hazara! Though that may be an artifact of some sort. The point is that as suggested by many scholars it seems highly likely that the Indo-European population of the Tarim basin was a composition, and that Tocharians and Indo-Iranians were both present. And,  probably did not appear at the same time.

So a second question that came to has to do with the origin of the Indo-Aryans, and the genetic history of the Indian subcontinent. About five years ago I told John Hawks that I was skeptical of too much European-like contribution to the Indian population because not enough European pigmentation alleles were segregating in the population. My inference was based on a wrong assumption. It turns out that the earliest steppe dwellers were not particularly pale of mien going by their genetic architecture on pigmentation loci. My objection has no basis, because the modern European phenotype is very new, and likely post-dates the arrival of Indo-Europeans to India. Additionally, there is suggestive evidence of a steppe connection, such as the widespread presence of the “European” allele for lactase persistence in Northwest India. This allele is new, and swept up in frequency very recently. Its presence in Northwest India almost certainly indicates non-trivial demographic connections.

The blogger at Eurogenes has illustrated the dynamic, but it’s pretty obvious that Northwest Indian populations have some affinity to the Yamnya population in particular. Below are the results from TreeMix using a narrower set of population than above. Notice how Pathan tends to move toward the Yamnaya…..











But why the affinity to the Pathan, and not the Iranian samples? Who knows. I’ll pull down the data set from the Willerslev lab soon, but I think ancient DNA from India is going to have to answer the question. But I’m curious how the “Out of India” people spin this, because they will have a ridiculous rationale….

