Proto-Indo-European and haplogroups

One school of thought in regards to the Indo-European languages suggests that they exhibit a “rake-like” phylogeny. That is, they expanded rapidly and simultaneously in all directions. Aside from the connection between Iranian and Indic branches, there’s no obvious connection across the others (satem-centum distinction aside).

In The Indo-European homeland: introducing the problem Thomas Olander produces the above chart. It is not rake-like. What jumped out at me are correspondences/connections to Y chromosomes.

The two main branches of R1a1a are found in the Indo-Iranian and Slavic branches of the Indo-European family. The coalescence is ~5800 years ago.

There are some suggestions that Italic and Celtic form a branch. As it happens, Italy, Celtic and ancient Celtic regions of Europe have very high frequencies of R1b.

What about the Tocharians? The Afanesievo people were basically Yamnaya-east. They had a lot of R1b. Today, a small minority of Uyghurs have R1b. Far more have R1a. But, I think it is important to note that the Tarim basin was mostly Iranian in the southern and western regions, and Tocharian in the north and east. The prevalence of R1a may simply be a function of the nature of the sampling.


  1. Probably a stupid question but why didn’t Indo-Iranian branch split into more branches?
    Europe has Albanian, Balto-Slavic, Celtic, Germanic, Greek and Romance branches while the languages of people from Kurdistan to Sri Lanka fall under just one Indo-Iranian branch. Is it a feature of how we classify languages? Or are Kurdish and Sinhala really more similar compared to Welsh and Latvian?

  2. @Harry, it does seem that way (all under one branch) and so either hypothesis would be that only one subgrouping entered South Asia, or there were levelling effects that spread more shared innovations and prevent splits or something.

    That said, some points trying to blend together your question around total similarity of languages with the general topic (trying to kill two birds with one stone), all with caveat that I’m not a linguist and these are my impressions of the material:

    1) On IE trees in general: There really is no consensus order of “One tree to rule them all” on higher order subgroupings.

    On either phonological and morphological data (sounds and grammar), or lexicon data (words), it is often the case that in the simplest trees, when sort of forced to form a tree by algorithm, Indo-Iranian languages form a relatively basal split to most European languages (those leaving aside Armenian and Greek), before they split from each other.

    Example: . This has a tree built lexical data from a study by Chang et al 2015 (not necessarily the best, but least contraversial because it has dates compare with consensus trees), and from a paper last year by David Goldstein as a primer examining different phylogenetic methods on phonological and morphological data ( The branch-and-bound phylogenetic tree on phonological and morphological data and lexical both produce a pretty similar outcome and structure…

    On the other hand, from the same analysis author finds just producing “strict consensus” or “majority rule consensus” on the phonological and morphological data just produces these “starburst” trees – Noting: “The multifurcation after the departure of the Anatolian languages reflects the fact that there is no consensus among the bootstrap trees concerning the early topology of Proto-Nuclear-Indo-European.”.

    (The author does find a tree producing a Indo-Balto-Slavic tree as well, under the Maximum-Likelihood Estimation method, as the most optimal maximum likelihood tree but notes “One general criticism that has been leveled at maximum likelihood methods is that they do not answer the question that historical linguists are most interested in. Likelihood assesses the probability of the data given a phylogeny and its parameters, but what historical linguists want to know is the probability of a particular phylogeny and a set of parameters given the observed data.”. A tree which does have an Indo-Slavic clade for instance is definitely possible, its more to me that it seems it doesn’t strongly emerge from the data as a consensus, and the evidence of large numbers of innovations uniting them is not very strong (so *probably* not for a very long time after IE split if did exist, and the existing similarities are not primarily due to contact and shared archaisms).

    2) On dating and diversity within Indo-Iranian: Where these morphological and phonological trees are referenced giving a higher-order clade structure of Indo-European (despite lack of strong consensus), when they actually putting them onto a time tends to still place all the initial differentiation between families early in the sequence. A lot of presentations of the tree structure tend to just emphasis the structure, because linguists don’t like really estimate dates, because linguistic morphological and phonological innovations are interdependent and happen at unpredictable rates, although obviously more innovations makes it more likely that more time has elapsed.

    E.g. the tree in Razib’s OP is noted in the text as derived from Ringe’s tree (which by the way uses the same dataset for which Goldstein found all the above other tree structures were possible under different tree methodologies), with modification to incorporate some other trees and apparently avoid Ringe’s Albanian-Germanic subclade. Now, as far as I can tell, in the re-print doesn’t seek put into any kind of time / dating really (just assigning a standard unit to each split on its tree). So it’s hard to evaluate the degree to which it’s actually rake-like (i.e. all the splits happen very quickly, then branches are long and independent!).

    When we actually look at the tree as presented originally by Ringe (from which this is derived), and he did try to put the splits into time, still shows a relatively rapid split in the tree, which is still pretty rake-like!: . The splits all still happen between 3000-2500 BCE in a very quick succession!

    (Reich lab seem very fond of this Ringe tree as well for some reason, I’m not sure why they are as attached to it as seem to be, using no other in their work as far as I know, as they must understand the limitations of phylogenetic methods and how various these trees can be and the lack of consensus?).

    People who use lexical methods are happier estimating dates, because they believe that basic lexicon change with 200 items is more reliably clocklike than morphological and phonological. Those often find pretty high lexical divergences between Indo-Iranian languages.

    The Chang tree above sort of constrains dates a bit by compressing them under directly dated ancient varieties, but there are others I’ve seen which because they don’t do this are allowed to show much more lexical diversity under II than other subgroupings (the much criticized Grey and Atkinson tree for instance). Paul Heggarty cite that there is often very little shared basic lexicon between Indo-Iranian languages, so they are quite different, although from the same subgrouping. So whether those are really because of highly basal split or something different, there is some evidence that there is some relatively high diversity lexicon within II.

    As a final thing there are also sprachbund effects which make languages similar despite lack of family resemblence and there are quite a few of these between extent Italic and Germanic languages, so actually existing European languages are in some ways not quite as diverse relative to II as the tree structure might imply.

    So yeah, I mean, tl;dr overall there do seem like there are more higher order subgrouping splits in Europe than in South Asia, but existing language diversity in South Asia is probably quite high in comparison in some ways too, which can be underemphasized by some tree presentations.

    Or basically it seems like don’t look at these trees as overemphasizing that these trees as show “tiny clade of un-diverse Indo-Aryan languages which are a mere tiny subsampling of diversity of IE, compared to huge stonking diversity of European IE”. Firstly there’s not a strong consensus of the tree and the place of European IE on it (whether they’re basal or a subclade, or whether there’s a star-like lack of structure). Secondly, even models with a detailed phylogeny of subgroupings, it’s still pretty rake-like, with that phylogeny happening high upstream, in a short period of time. Then finally sprachbund effects are bringing European IE closer together too and moderating effect of some deep splits today.

  3. It has to do with date of settlement. The evidence shows Indo Europeans settled Europe earlier, which means more time to diversify into distinct subgroups. Since the ancestors of Indo Iranians didn’t move to Iran and India until later they had less time to diversify.

  4. A few things that people can agree with:
    -There was a common Indic-Iranic-Baltic-Slavic clade
    -There is an Italo-Celtic clade
    -Anatolian and Tocharian are the most divergent

  5. Harry Jecs,

    There is an increasing scholarly consensus that the Nuristani languages are neither Indo-Aryan or Iranian, instead comprising a third branch of the Indo-Iranian language family. Thus it’s not fair to say there are just two branches of Indo-European in South Asia.

  6. @DaThang

    *Most* people agree about Anatolian and Tocharian, but definitely not the other two.

  7. Harry Jecs: “Is it a feature of how we classify languages? Or are Kurdish and Sinhala really more similar compared to Welsh and Latvian?”

    I’m basically echoing part of Matt’s tl;dr here but I suppose you can have a more recent common ancestor but end up with very different descendants depending on other factors. So maybe the assumed linguistic “common ancestor” of European IE is a good deal earlier (Steppe_EBA) compared to the assumed “common ancestor” of Indo-Iranian (Corded Ware-derived Steppe_MLBA), which would be reflected in how it might be represented in trees, but this also might underrate how diverse later and currently extant Indo-Iranian is even if it starts “splitting” later on.

    Though I’m guessing there are also plenty of other factors at play, like geography or the expansion of specific branches making certain aspects of diversity and genealogy unclear.

    Strictly speaking, I suppose we also can’t really know for sure whether certain areas were linguistically Indo-Europeanized within Europe until relatively later on, even further north where you have some areas of apparently relatively low steppe and high EEF-HG ancestry and a mix of Y-DNA. Thinking out loud, maybe we could even consider possibilikes like the extant European branches (or larger areas of more related varieties at least) being sometimes relative islands of IE, where closely related varieties coalesced together, within a more non-IE landscape even by later on and only then starting expanding towards each other, which made them diverge more in some ways. Obviously we lack attestation of languages until relatively late to tell whether there were even more northern “Rhaetics” for sure either way.

    The Chang tree actually would correspond quite well to the genetic data we currently have but looking at the Goldstein paper, you could also come up with different, plausible ways to fit the current genetic data as well depending on what you think more likely. Still kinda unclear as to these intermediate specifics then.

  8. “Anatolian and Tocharian are the most divergent”

    The divergences are IMHO not primarily due to time depth. The naive mutational variation accumulation over time model of language divergence greatly overestimates that importance of that component of language change, which is actually much slower, and ignores the central role played by language contact.

    One example of that are Icelandic, which was until very recent times when telecommunications and air travel became available, the closest of the Germanic languages to Old Norse (which is basically proto-Germanic), mostly because it had less contact with other languages due to its isolation at the frontier.

    Another example is that phonetically, the Appalachian accent is the closest modern dialect of English to the Elizabethan English of Shakespeare, again, due to low levels of contact with other dialects of English.

    Low population sizes also reduce mutational change in both case.

    Also, language divergence actually tends to be punctuated.

    The divergence between Old English and Middle English, for example, is largely due to the singular impact of French Norman influence on the language after the Norman Conquest of England, in the common case of language change due to emulation of elite dialects (one of the most common sources of homogenization of language in a region). Language replacement scenarios also usually involve strong substrate influences (e.g. the quirks of the South Asian dialects of English) especially for words with no superstrate language counterpart like local botany words, and also often simplifications of language structure due to mass language learner effects.

    The differences in American English from British English, in contrast, reflect another common punctuated influence, where a community of people deliberately exaggerate local dialect differences in order to create shibboleths that expose outsiders and to distinguish themselves culturally from a community that they are alienated from.

    Language contact usually has mostly lexical impact (i.e. loan words), but also can give rise to other areal and contact language features (like the sentence closing term “lah” in Malaysian and Singaporian dialects derived from Arabic traders), and place names (e.g. Punic place names in Britain and Ireland).

    The other key point is that in almost all of the Indo-European language family’s European ranges, hunter-gatherer languages were extinct or all but extinct, and the substrate first farmer languages shared a descent from the language family of Western Anatolian farmers (probably in two main subfamilies, one for Linear Pottery Farmers in the Danubian basis and point north, and the other for the Cardial Pottery Farmers of the Mediterranean coast). As societies lacking metal and horses, these Neolithic first farmers of Europe also had fairly low population density (even though it was 100x that of terrestrial hunter-gatherers), so due to low population density and frontier status, the amount of divergence between the first European Neolithic farmers and the struggling farmer societies a couple of thousand years later when Indo-Europeans filled a vacuum was probably modest.

    This shared substrate over so many Indo-European subfamilies no doubt hides the extent of Anatolian Neolithic language family substrate influence in them. But not all Indo-European language families shared this substrate.

    Tocharian is divergent because it is the purest of the descendants of Indo-European, because they had virtually no substrate or language contact, were on a frontier, and weren’t a particularly large language community despite fairly high population density, because it was geographically constrained to a handful of towns, rather than due to its great antiquity. Notably, J.P. Mallory, one of the leading Tocharian scholars, came around in about 2012 to the view that the Tarim basin civilization isn’t all that old, based upon archaeological evidence stating:

    [T]here is really no serious evidence for arable agriculture (domestic cereals) east of the Dnieper until after c 2000 BCE (see also Ryabogina & Ivanov 2011; Mallory, in press:a). This means that there is also no evidence for domestic cereals in the Asiatic steppe until the Late Bronze Age (Andronovo etc). From the perspective of the Pontic-Caspian model, the ancestors of the Indo-Iranians and Tokharians should not cross the Ural before c 2000 BCE at the very earliest. Hypotheses linking the Tokharians to earlier eastward steppe expansions associated with the Afaasievo or Okunevo cultures of the Yenisei or Altai (Mallory and Mair 2000) become very difficult if not impossible to sustain (as long as there is no evidence of arable agriculture in these cultures) as Tokharian retains elements of the Indo-European agricultural vocabulary.

    – J. P. Mallory, “Twenty-first century clouds over Indo-European homelands” (Conference Presentation in Moscow, September 12, 2012).

    Mallory made the case in a 2011 talk that R1b was a Tocharian genetic signature based upon West Eurasian Y-DNA haplogroups found in Uyghur populations that were direct successors to and brought about the fall of the Tocharians during a period of Turkic expansion. There is also R1b in Iron Age ancient DNA east of the Tarim basin from what appears to be a related West Eurasian culture. But, ancient Tarim mummy DNA from ca. 1800 BCE, analyzed in 2009 showed uniformly R1a1a Y-DNA haplogroups (citing Li, Chunxiang, et al., “Evidence that a West-East admixed population lived in the Tarim Basin as early as the early Bronze Age” BMC Biology (February 17, 2010)).

    In the case of the Anatolian languages, in contrast, strong contact with a highly divergent substrate from the Anatolian Farmer substrate of Europe is what explains its divergence.

    In the early metal ages, the Hattic language and civilization, probably derived from metal using civilizations of the Caucasus mountains and Zargos mountains in a language family that may have also included Hurrian and is probably modest strongly related to one or more of the modern Caucasian languages, spread across Anatolia replacing the Anatolian farmer language, and there is suggestive evidence that the Minoan language was also from the same language family (e.g. the phonetic structure of the two languages, recorded in the Minoan case by Eto-Cretian inscriptions and Egyptian phonetic records of Minoan incantations).

    Documentary and archaeological evidence, however, suggest that the Hittites occupied only a few towns in a sea of Hattic people ca. 1800 BCE, before their dramatic expansion, roughly contemporaneous with the appearance and expansion of the Mycenaean Greeks (the first Aegean people to speak Indo-European languages), and while there are many Anatolian languages attested, all but a couple have the relationship of the Romance language to Latin with Hittite, and the couple of earlier ones are not attested significantly earlier than the Hittite language. Iron use and cremation were important litmus tests of Anatolian Indo-Europeans corroborated with documentary evidence and archaeological evidence in the post-1800 BCE time period. I review some of that evidence here. It also isn’t clear how much of the Anatolian languages were elite imitation driven (compare Hungarian ca. 1000 CE which results in language shift without much demic impact), and how much was due to population replacement/introgression. The frequency of R1b in modern Anatolian samples suggests a significant demic component, but is complicated by periods of Hellenic sourced migration into Anatolia.

    So, how do the Anatolian languages grow so divergent?

    Because the Hattic substrate in which they were immersed (to the extent that it influenced their choice of proper names and that Hattic remained a liturgical language in the Hittite empire centuries after it ceased to be used in daily life, like liturgical relicts of ancient Hebrew, ancient Latin, ancient Sumerian, and Coptic), was profoundly different from that of the Anatolian Neolithic farmer substrate in Europe or the Harappan substrate in Sanskrit (which may be shared by all Indo-Iranian languages as BMAC was in the Harappan sphere of influence and both Harappan and BMAC languages may be derived at great time depth from the Caucasian Neolithic first farmer substrate). Copper age/early Bronze Age Hattic culture and language (like Copper Age/early Bronze Age Harappan culture) also had more staying power and influence than stone age European Neolithic culture because more advanced civilizations had more populations density and couldn’t just be trampled into oblivion by Indo-European successors. My suspicion is that pre-conquest acquisition of metallurgy also was important in allowing Basque and related Vasconic languages to survive Indo-European obliteration.

    In the same vein, Armenian is hard to classify because it has mixed influences for different neighboring Indo-European language families, with Greek influence competing, for example, with Indo-Iranian language influence.

    We don’t have enough Hittite, other Anatolian language speaker, and Hattic ancient DNA to confirm an appearance of steppe ancestry around 1800 BCE and its absence before then in Anatolia, but we also have no ancient or modern DNA evidence that isn’t a good fit to that hypothesis.

  9. why didn’t Indo-Iranian branch split into more branches?

    I think the Indian part is easy to answer. IA languages were concentrated in he northwestern part of the subcontinent for a while, and then moved and diverged quite gradually, but then there were periodic political and religious factors that led to convergence.

    The “Iranian” part is more remarkable. Apparently this language branch once covered a massive chunk of Asia, from the steppes to the north of the Caspian sea east to what’s now northwestern China south to the borders of the Indus Valley and then west again almost to Mesopotamia. That’s a massive range to hold on to for 2-3 millenia. So the more remarkable thing is not that Kurdish and Sinhala have something in common but that Kurdish and Ossetian do (and deeply).

    Probably the only thing that can explain it is that the population densities of these places were very low right from the beginning, so there wasn’t chance of dilution of the language (except in the near east, where the branch did split into Farsi, Median, etc.) Until the Turkic and Mongol invasions, which basically wiped off the language everywhere other than in pockets of the Caucasus and in Greater Iran proper.

  10. R1b isn’t very frequent in Italy. It is only frequent in Northern Italy. But before Eastern Meditereaen geneflow in Roman times it probably was frequent.

    5 of 6 ancient Latin Y DNA samples so far are R1b.

    Almost all ancient Celtic Y DNA samples so far are R1b.

    So it confirms the connection. But, for Celtic, the spread of the language didn’t cause a turnover of Y DNA. Most R1b in Western Europe is pre-Celtic, but still Indo European Bell Beaker.

  11. Slight change from above, got to admit a mistake, minor to overall point but should note it; where I’d said the paper by Goldstein in 2020 came to similar trees as lexical ones (like the Chang, Grey+Atkinson etc trees) based on only the phonological and morphological data used from Ringe’s tree (which is also basis for tree Razib references in post as mentioned). Got to say I was not actually correct – those trees in that paper all use all the lexical data from Ringe’s group as well, which probably explains why results are *so* similar to lexicon only trees, with a basal position of II relative to European clades (particularly without any heavy weighting to phonology and morphology, which I believe Ringe’s tree did, lexicon dominates). A bit of a goof on my part.

    *But* I think Goldstein’s figures showing patterns in the main phonological and morphological characters sitting alongside the trees computed on entire set of characters (although confused me) are useful for showing high level structure depends on only a few characters (e.g. only
    2 “phonological characters P2 and P3. P2 encodes full “satǝm” development, according to which PIE labiovelars merge with velars and “palatals” become affricates or fricatives. P3 refers to the “ruki”-retraction of *s” link the B+S and II, out of possible set of 22. Most phonological or morphological innovations on pIE in this set simply link to one subgrouping, don’t group subgroupings together.)

    As I understand it there is a further question of why B+S (or at least B) preserves some more archaisms (or “retentions”) on average than other IE, and why some of those link to II. Those don’t tell us about language shared descent structure (as unrelated could simply happen to retain features due to low rates of change). But it’s an interesting question as to why this happened. Maybe difference in patterns of contact over time.

    But one possible caution on that question (why different levels of retention) – people (me in the past for sure) probably overanalogize genetic admixture to language change, and tend to think of fast language change usually as motivated by imperfect learning from drawing new speakers into community, or bilingualism following contact and new features. This is not necessarily the best model though; it’s also plausible that languages changes in populations that do not especially admix or contact with other language speakers but where individuals want to demonstrate their difference from each other (as groups or individuals), and loss of words and such in small isolated languages, or greater innovation is rising populations. (Lots of papers by Simon Greenhill on these questions). Languages are spoken by ppl who might have motivations to prefer linguistic innovation or flexibility ay different times, and difference in innovation rates is not just (or necessarily even mainly) replication errors, and hard perhaps to predict on features like contact, population size.

  12. I read (twice) the Thomas Olander’s article. I guess, he is the next Razib’s podcast guest and because of this I am putting my gloves while writing this comment. I find, the main quality of the article is tabling two questions which I’ve been also asking for a while – the timeline of so-called Proto-Indo-European and IE languages. He did not answer his own questions what is not strange. The whole academia for 200 years was chasing own tail trying to make a sense of this ’Indo-Germanishe’ term, renamed to ‘Indo-European’, after the balance of power changed out of German favour. And of course, from smelly ingredients is not possible to make a dissent aromatic pie. Even Wiki states ‘(P)IE’ as a fictional construct but the author took this for real. So, the output of TO’s research was predictive.

    There are quite of few humorous details. On the top of my mind are – Indo-Albanian? (what is the connection btw ‘Indo’ and ‘Albanian’ which came without this name in 1043AC to Balkan), Romanian as Latin-derived language (artificial language, ‘frankensteined’ 160 years ago?), Romance languages are Latin derived languages, ‘ancient’ Greek (the term used widely after Aristotle and Roman conquering), Indo-Iranian came from Anatolia , Lithuanian (>15cAC?), Bulgarian (what is this?), etc. It seems that author was lost in time and space, we don’t know where the beginning is, what causal connections and movements are. There is an overview of ‘Anatolian’ and ‘steppe’ theories. Although for both were found ‘showstoppers’ there are still referred until the truthful version is found.

    What is a “rake-like” phylogeny? Is there a chronology or causal connections (Indo-Albanian ). Where is the logic in stating that Yamnaya (R1b) people brought ‘Indo-European’ language to Europe if R1a Aryans brought the same language to SA? Where this language switched between r1b and r1a although, theoretically, they spoke the same language? How likely is that in Russian steppes covered with ice, with sparse population, without economy for several thousands of years was developed (P)IE?

    What about answering the following questions? What and where is the cradle of the oldest European civilisation? Who were indigenous European people? Where was the first metallurgy in the world? Where was the first network of hundreds of urban settlements? Where was the first art developed, first wheel, first cheese, first mini-skirts, etc? Were these city slickers – hunter gatherers? Where was the oldest alphabet in the world found in 80 different places? And all these findings in hundreds of places researched 2-3% so far and several thousands of years before Yamnaya with their ‘Indo-European’ language came to Europe and well-developed agriculture much before ‘Anatolian’ (i.e. Caucasian) peasants came to Europe. Can anyone compare the civilizational and technological levels of indigenous Euro people and Yamnaya incomers? Neither author nor anyone else says that these Yamnaya people conducted a genocide against indigenous people (we have Anthony’s euphemism – population replacement or TO’s ‘interaction btw migrating pastoralists and indigenous Neolithic group’).

    What is the Sanskrit, what was its predecessor? Which modern language is the most similar to Sanskrit and why? Where the similarities of family relationships (e.g. a single word for ‘husband’s brother’s wife’) in Sanskrit came from? How it is possible to link Sanskrit and languages 3-4000 years younger (e.g. English and German)? Instead of complicated, unconvincing comparisons, let’s find out where English words ‘land’, ‘cat’, ‘heart’ came from. Previously mentioned European civilisation was accompanied with (definitely) sophisticated language. The continuity theory says that this language did not change since Ice Age up to today. If someone states different, he/she must prove the discontinuity and low steppe culture (and language!) replaced much higher. When, how, timeframe? With TV and internet would be very difficult, not mention wooded, swampy Europe without roads and bridges.

    There are so many things and questions for discussion. Just another one which can be seen between the lines. There are so-called ‘extinct’ languages – Luwian, Lydian, Lycian, Old Prussian, Gothic, etc. I could add some others, e.g. Illyrian, Thracian (mother tongues of dozens of Roman Emperors, including fairly recent Justinian, Constantine, Diocletian, etc) and others. What is common for all these ‘extinct’ languages? I will stop right here, otherwise the thing will go out of control. Thanks for your attention and for reading this long comment.

  13. “it’s also plausible that languages changes in populations that do not especially admix or contact with other language speakers but where individuals want to demonstrate their difference from each other (as groups or individuals), and loss of words and such in small isolated languages, or greater innovation is rising populations. (Lots of papers by Simon Greenhill on these questions).”

    Fair qualifications.

    Certainly, “random” mutation like change in language over time happens. The greater innovation with rising populations and word loss with falling populations, are also consistent with this basic mutational model, as long as it is expressed in mutations per person-year, rather than merely in years, much as you would in a mutations per person-generation in a genetic model.

    For word loss of words that are not “exotic” (e.g. technical religious terms from religions that are no longer practiced, or names for things that are no longer encountered like flora and fauna from a former homeland), however, I think you need quite small populations of speakers, probably on the order of hundreds to single digit thousands at most, for it to be significant, even on the scale of many centuries, and even less in a literate culture.

    The magnitude of the “mutation rate” in language is routinely overestimated, however, because adequate consideration of language contact impacts are frequently ignored in estimating it, even though language contact is a very big part, maybe the main part, of the story of language change. It is a bit like the counterintuitive observation from population genetics that it takes shockingly very little gene exchange between two genetic populations (the stylized fact is one instance of gene exchange per generation without regard to the size of the segregated populations) that have split to keep them from diverging much.

    Intentional population segregation into subcultures certainly happens too (American English being one example). But I think it is also fair to say that this is generally much more rare than language contact based change, and that it is likely to affect some features (e.g. phonology) more than others (e.g. grammar).

    Also, intentional differentiation is only going to happen when the segregated populations face different environmental circumstances and pressures that drive cultural differentiation that drive the desire to do so, which usually involves some kind of migration event.

    For example, Australians who had mutual animosity with the British who exiled them for crimes and were in conditions utterly alien to those of Britain (Australia is a mix of vast hot deserts and small patches of rain forest with marsupials and lots of deadly critters), differentiated their dialect from standard British English much more rapidly than New Zealanders, who left on good terms and were trying to recreate British society in their colony in a place with much more similar conditions (like England, the areas in NZ first settled were wet, have lush vegetation and mild temperatures on a constrained island with almost no dangerous native fauna).

  14. Numinous said that Indo-Aryan languages had been concentrated in the NW corner of India while Iranian languages spread over a vast area.

    I think it is erroneous to equate the modern distribution of Indo-Aryans with the ancient one.

    If they came to India from the Steopes,they should have left a trail for sure.

    Mittani are an example of that.

    In addition,Kuzmina supported a Fedorovo Origin for Indo-Aryans and Fedorovo sites are found in a large area of Siberia,Central Asia and Xinjiang.

    And it is also probable that Iranians may have superimposed themselves on the previous Indo-Aryans,as they were the second wave out of Andronovo Culture.

  15. For word loss of words that are not “exotic” (e.g. technical religious terms from religions that are no longer practiced, or names for things that are no longer encountered like flora and fauna from a former homeland), however, I think you need quite small populations of speakers, probably on the order of hundreds to single digit thousands at most, for it to be significant, even on the scale of many centuries, and even less in a literate culture.

    For losing words by forgetting terms, that may be so, but will add that it’s worth considering that sound changes may give rise to ambiguity in perception, and then you can have shifting to a secondary minor term which lacks such ambiguity; e.g. hund if merges with hand, then perhaps you would shift to the term dog. This is not necessarily why it happened (it is almost certainly not, as I made this example up and it is probably inconsistent with history in various ways!), but the English shift to “dog” from “hound” is an example of a shift of a common term to a term of unknown origin, that almost certainly is not due to adopting a term due to contact, for unknown reasons, within a single language variety at a relatively recent time (by the standards of the timescales we’re talking about).

    Of course, how significant these effects are overall is another thing, but I wanted to just get back to the idea that languages are about encoding information, and so are heavily adaptive, and that changes are heavily interrelated. One sound change, driven for some social prestige reason, might then affect language as a meaning system in ways that lead to many other changes. In a way we could think of it as a mutational system where when one mutation (which may arise through contact or from other process, often motivated) can be pushed to high frequency, this can then generate selective pressure for other mutations?

    (One of the reason the “core lexicon tree” phylogenetics people like those trees is because although words are affected by this, they can use a great mass of these core words which are relatively independent of each other and in theory fairly stable, which means that rates of change should be in theory easier, because they are less influenced by complicated interconnections and should even out over time…. But the same reason of course that more conventional historical linguists – who focus on smaller number of definite sound and grammar changes that can in theory be put into definite sequence – are skeptical of them; the very independence means that they can be borrowed or change).

    I think balancing these concerns against the idea of language contact as heavily influential can be done, but testing the degree of it (rather than assuming or guessing it) might require more information than we often have to reconstruct linguistic histories? Though of course many attempts.

    (One other example, though with caveat I know little about it; why the very profound change of extensive tonogenesis innovation in Sino-Tibetan languages that remained “in the homeland”, by theYellow River theroy, rather than those which departed it? Middle-Chinese develops tones that Old Chinese and most Sino-Tibetan varieties lack. Why more tones in Middle-Chinese than Modern Chinese (reverse of direction)? Why now more conservative tone systems more like Middle-Chinese in languages on the southern frontier of contact than in Mandarin? Is this all Contact or some other forces? Although tone is a sprachbund feature that’s a powerful example of contact *within* SE Asia, it doesn’t seem to explain much for the origin in Chinese itself.)

  16. Just a small addition…

    Linguists regularly use inappropriate and backdated terms – Balto, Slavic, Indo, Iranian – did not exist at the time discussed in the paper and we don’t know what this is about. Just to say for so-called ‘Slavic’. The term is from the 7th cAC, so what does it mean ‘the coalescence (Indo-Iranian and Slavic) is ~5800 years ago’?

    Also, they avoid using the timeline because this, better than any archaeology or genetics, uncovers the lack of logic in their constructions. Let’s see the logic about Yamnaya R1b ‘Indo-Europeans’ who allegedly brought so-called ‘IE’ language to Europe.

    They (R1b) came to Europe in 2800BC. R1a people came with ‘Indo-European’ language (which localized SA version will become Sanskrit) to SAsia in less than 1000 years later. It was established that R1a ‘travelled’ about 800 years to SA. It means that in this time period Yamnaya nomads conducted a genocide in wooded, swampy Europe, spread their language to every corner of Europe to people of much higher culture, who totally forgot their previous languages. They somehow and somewhere enforced R1a people (who were starting to move toward SA) to speak ‘Indo-European’ and who also have forgotten their previous language (which one?). R1a also made the first draft of Rg Veda in this new IE language (and probably translated their previous mythology to new language). And all this while constantly moving to the East?

    How likely is this scenario?
    Can someone estimate how many ‘Indo-Europeans’ came to Europe and how many indigenous people lived there at that time?

  17. Btw, if anyone is interested in a “state of the field” on computational linguistic trees, Paul Hegarty from Max Planck (a somewhat divisive figure in these discussions, but anyway), has dropped a new “state of the field” and history review, last month –

    It’s paywalled, but the Supplement is open and includes the whole review – , with an annex.

    Lots of discussion of where overlapping data are used, where models are based on limited amounts of data, problems with existing cognacy databases. (Heggarty tends to be highly critical of the Chang paper upthread, because he sees it main act of compressing the differentiation of present day languages under ancient varieties which were somewhat artificial, in his view, as inconsistent with history).

    Major criticisms of the dataset (Dyen List) used as backbone for all these. Max Planck working on a new dataset to confront the problems. (Why these problems don’t bleed into other datasets using similar methods, e.g. Austronesian, Sino-Tibetan, etc which are very much consistent with historical assumptions is a bit mysterious.)

  18. One final IE lexicostatical paper is this one from 2019 (as a preprint at least) – (reminder of this paper anthrogenica again).

    I was a bit suspect of it as it takes the approach of using reconstructed proto-language lexicon, then drawing its trees from there, as opposed to just coding actual attested languages and then letting everything sort itself out. *But* since they draw their lexical dataset independently of the lists that have evolved from the Dyen List that Hegarty criticises, may be worth a look.

    It finds a relatively rake like topology, but with a star/rake like split between only 4 higher order IE subgroups (not the full set of totally independent familes) at 2800 BCE (which seems consistent), which then fully split at around 2400-2000 BCE depending on subfamily. The typology is not totally consistent with one of those Ringe type typologies (where Greek-Armenian and Germanic are put with II-BS for a bit) but it seems roughly OK too. They did code the trees to merge splits that occur with short ranges of time though (<300 years), which does sort of affect results, making things more rake-ish (but their interpretation is that this makes more sense).

  19. @ohwilleke

    Why isn’t Greek, which seems to have a similar substratum (perhaps in addition to a more northern, common European one but some of these hypotheses seem a bit shakier anyway), that might very well be Hattic-related as you said, seen to be as diverged as Anatolian to most linguists who consider Anatolian an early splitter instead though? You even have very prominent mythological names that have often seemed pre-Greek to linguists, like Achilleus and Odysseus, since you mentioned anthroponymy too.

    I appreciate that you note the existence of the non-Hittite branches that seem to occupy a rather large area in the more southern parts of Anatolia by that time period. An alternative would be early entrance and relatively slow expansion if we suppose lower turnover compared to Europe.

    Tocharian moving very far from the homeland and so appearing less innovative makes sense (also considered for Italic and Celtic from what I understand, as “far westerners”) but the general idea doesn’t seem to put the dissolution of post-Tocharian branches that far into the future compared to the departure of Tocharian. In general I haven’t seen either linguists or archaeologists try to potentially associate Tocharian with events as early as Anatolian, which seems to be in a class of its own in that regard.


    Well, I again can’t comment on some important parts of their methods like you can but they also produce trees and dates (check their mean dates and compare them to known archaeological horizons like Afanasievo, the dissolution of Corded Ware and Trzciniec, Middle Helladic Greece; incidentally Tocharian also splitting from the remaining not long before the rest split) that make lots of sense to a layman. You can even argue for a separate eastern origin within Corded Ware and Beaker respectively for the northern groups and correlate it with the very separate Y-DNA we have so far, based on this kind of model. Post Tocharian, they present a trifurcation that would correspond to the three main plausibly IE-related horizons very well, leaving aside the problematic (as they note) Albanian.

  20. @Forgetful, yeah, it does seem reasonably plausible and may be best bet, at least until there are further publications in the field from other datasets to compare to, those aim to solve those problems with the previous dataset that Hegarty describes.

    It could be that properly constructed lexical analyses can pick up some relatively subtle shared linguistic community post-nuclear pIE divergence which is not detectable in sound and grammar innovations.

    I’ve previously said that there’s a good parallelism between there not being extensive innovations which unite post-pIE subgroups (and those that do being probably contact rather than actually representing an undiverged community) and a simple star-like expansion of all the well-recognised subgroups without intermediate stages (protoX+Y)…

    But maybe some of the early sound+grammar linguistic evolution was just very slow and did not really happen until relatively late? (Garrett says some things in this direction when talking about Ancient Greek in his 1999 paper.)

  21. So-called ‘ancient’ Greek language is a joke. The term ‘Greek’ was first mentioned by Aristotle and widely used from the 2nd cBC after Roman conquering. So, the terms such as ‘Minoan Greece’, etc, are oxymorons. Odysseus, Achilles, Homer, Orpheus, Alexander, Trojans and many others were NOT Greeks. So-called ‘ancient’ Greek heavily borrowed from the language of indigenous people who lived in today’s Greece and Asia Minor.

    Regarding one linguistic article (re ‘Indo-European’ language) cited above – it is applicable what I already mentioned above – “from smelly ingredients is not possible to make a dissent aromatic pie”. When you put in the fuzzy logic in the article to look more scientific, the situation does not change, only the original ingredients become more prominent.

  22. @jm8, R1b:R1a in Near East today seems pretty areal. See good regional sample in Iran (938 males so quite a good sample size for coarse grained variation) :

    Armenians got more R1b-M269 (in this sample no R1a). R1b in Near East mostly M269 (not V88). Overall frequency in Iran of R1a:R1b-M269 about 14:9.5. Clinal change in ratios West:East, even within Persian speakers. AFAIK Turkey has more R1b-M269 than R1a.

    Maybe future adna record will help explain more of these things…

  23. @Forgetful, you’re probably already aware of this, but new presentation by David Anthony uploaded onto Youtube yesterday –

    (This is from: “Power, Gender and Mobility – Features of Indo-European Society – Online Conference, March 26–27, 2021”:

    I believe it is titled: “Proto-Indo-European Kinship According to Ancient DNA from Steppe Cemeteries”. The abstract is: “Proto-Indo-European kinship according to ancient DNA from steppe cemeteries: preliminary results. While kinship and biological descent are not the same thing, they are related. New data from ancient DNA on family relationships within Eneolithic and Yamnaya cemeteries, arguably linked to archaic PIE and late PIE, suggests that family relationships within and between cemeteries changed significantly between the Eneolithic and Yamnaya periods. Closely related males were buried together in Eneolithic cemeteries, but not in Yamnaya cemeteries. Females were unrelated to males within 3 degrees in both contexts, eliminating cousin marriage as a possibility and suggesting a virilocal system with required female exogamy. Genetic diversity in maternal descent was high across both periods, but genetic diversity in paternal descent collapsed in Yamnaya males, producing a surprisingly homogeneous set of Yamnaya men who nevertheless rarely were related to each other within 1st, 2nd or 3rd degrees, but instead shared a small group of male ancestors 4-7 generations before.”)

    I’ve excerpted the PCA from it and added some annotation:

    You’ve mentioned previously that ” “Romania PreYamnaya”, which I assume refers to individuals like the ROU_BA set in the Eurogenes Global25, who are shifted towards Steppe Maykop and Lola-like/WSHG-rich populations compared to a Steppe_EBA-Balkan-like mix”, while I thought they just regular CA Romanian EEF.

    It seems from this PCA that neither of us were exactly correct – the Glavesnesti samples ROU_BA samples that David put on G25 are present on the PCA and do not appear to be “Romania PreYamnaya”… *But* also “Romania PreYamnaya” appear to be group with some steppe ancestry of some form – I’ve labelled the group I think “Romania PreYamnaya” is, but unfortunately the graphic uses these Grey Stars for lots of groups, so they are difficult to distinguish, so I can’t be 100% sure I’m right.

    Other interesting facets:

    1) Yamnaya Russia Don seems to separate slightly from other Yamnaya (more HG shifted), and overlap with one sample from CWC_Latvia, which is presumably the very early CWC from Latvia (the comparable very early Lithuanian and Polish CWC samples are not on the plot as far as I can tell), together with a Sredni_Stog sample. That’s interesting.

    2) The Yamnaya Hungary cline actually connects to the CWC->Unetice European cline, which challenges a bit the idea that Yamnaya Hungary will be “off cline” to explain any subsequent European culture (although they may still not be the real ancestors). They don’t seem to be like the Yamnaya Bulgaria outlier at all.

    3) There’s a pretty robust set of samples for Khvalynsk cline, and the talk discusses a very large number of samples. Models testing whether Khvalynsk+Maykop is plausible at all for Yamnaya will be very testable. DA still seems pretty confident that doesn’t work! The centre of the Khvalynsk cemetary they have samples from (Khvalynsk II) looks more HG than Yamnaya, but possibly less than the 3 samples that we’ve been using to date.

    (They’ve had to be selective in samples to avoid over populating the plot I think; no Steppe_Maykop for example. Confusingly the plot has “CHG” and “Bell-Beaker” annotations by DA, but these samples don’t seem actually projected on).

    Lots of talk in presentation about Yamnaya kurgans on the steppe being potentially selective from a particular patrilineal group, and maybe selective from population (so height in samples, etc may be linked to nutrition?), but not closely linked to a “nuclear/extended family” in close families, as linked a number of previous generations ago. He talks about achieved status, I wonder if they were chosen representatives from a wider kin-ship group in some sense (just as we find individuals who tend to have a male kin relationship selected for burial in megalithic monuments, but they’re not nuclear/extended families).

    DA also references a preceding talk by Alissa Mittnik where he indicates she talked a lot about the IBD and RoH reconstructions, but I can’t find this online. (This talk is presumably the one titled: “An Archaeogenetic Perspective on Mobility and Social Dynamics among Indo-European Societies – Alissa Mittnik (Harvard University)” Abstract: An archaeogenetic perspective on mobility and social dynamics among Indo-European societies – In the last three decades, the emerging field of ancient DNA research has had an impact on the study of the human past akin to the radiocarbon ‘revolution’ of the 1950s, and similarly has challenged scientists and archaeologists to work interdisciplinarily and find a common language. Continental scale studies have given us insights into the timing of population movements, their correlation with cultural transitions and possible resulting population admixtures. With an ever-growing dataset of ancient genomes and novel analytical methods, researchers increasingly focus on higher resolution inquiries into local and regional time transects, interaction networks and social dynamics, and incorporate dense archaeological and isotope data. I will outline (archaeo-)genetic methodology and the insights it has given into the spread of ancestry related to pastoralists of the Pontic-Caspian steppe into Europe and Central and South Asia in the Eneolithic and Bronze Age, associated with the dispersal of Indo-European cultures and languages. In the second half of my talk, I will focus on a microregional study of the Lech valley in southern Germany, as well as some other case studies, to detail how the integration of DNA, stable isotopes and archaeology can inform us about the role of status, gender, familial and community structure and rules of inheritance and residence and how these might have influenced the development of Indo-European societies. ).

