Genomic ancestry tests are not cons, part 2: the problem of ethnicity

The results to the left are from 23andMe for someone whose paternal grandparents were immigrants from southern Germany. Their mother had a father who was of English American background (his father was a Yankee American with an English surname and his mother was an immigrant from England), and grandparents who were German (Rhinelander) and French Canadian respectively on their maternal side.

Looking at the results from 23andMe one has to wonder, why is this individual only a bit under 25% French & German, when genealogical records show places of birth that indicates they should be 75% French & German (more precisely, 62.5% German and 12.5% French). Though their ancestry is 25% English, only 13% of their ancestry is listed as such.

First, notice that nearly half of their ancestry is “Broadly Northwestern European.” Last I  checked  23andMe uses phased haplotypes to detect segments of ancestry. This is a very powerful method and is often quite good at zeroing in on people of European ancestry. But with Americans of predominant, but mixed, Northern European background rather than giving back precise proportions often you obtain results of the form of “Broadly…” because presumably, recombination has generated novel haplotypes in white Americans.

But this isn’t the whole story. Why, for example, are many of the Finnish people I know on 23andMe assigned as >90% Finnish, while a Danish friend is 40% Scandinavian?

The issue here is that to be “Finnish” and “Scandinavian” are not equivalent units in terms of population genetics. Finns are a relatively homogeneous ethnic group who seem to have undergone a recent population bottleneck. In contrast, Scandinavia encompasses several different, albeit related, ethnicities which are geographically widely distributed.

Ethnic identities are socially and historically constructed. Additionally, they are often clear and distinct. This is not always the case for population genetic classifications. On a continental scale, racial classification is trivial, and feasible with only a modest number of genetic markers. Why? Because the demographic and evolutionary history of Melanesians and West Africans, to give two concrete examples, are distinct over tens of thousands of years. Population genetic analyses which attempt to identify or differentiate these groups have a lot of raw material to work with.

To the left, you see a PCA plot of Papuans, Yoruba, and Swedes. They are clear and distinct populations. I pruned the marker set down to 750 SNPs. Now, since these were SNPs selected to be variable in human populations, they aren’t just random markers. They are biased toward being informative of population history. That being said, notice how distinct the groups are.

The Yoruba and Swedes and Papuans are separated by 50,000 to 100,000 years of history. That history is reflected in the genetic variation. And the social construct of an ethnocultural identity is nested within that demographic history. The Yoruba people are a coherent cultural unit. Similarly, the Swedes emerged in the last 1,000 years through a fusion of tribes such as the Geats and Svear. The Papuans are a different case, as “Papuan” brackets a whole range of groups. To a great extent, one can argue that a self-conscious Papuan identity is a product of the 20th century, because of political forces (the independence of Papua New Guinea), and large-scale contact with Europeans and Austronesians. Nevertheless, when comparing extreme different groups, an artificial catchall ethnic identity such as “Papuan” is quite informative.

Using the same marker set I plotted individuals from the Yoruba and Esan ethnic groups from the southwest and south of Nigeria, respectively. It is immediately clear that you can barely differentiate the Esan from the Yoruba geneticallyAt least with 750 SNPs.

The Esan and Yoruba have distinct identities, but culturally they are not too distinct from each other. They even share some traditional deities. Being close neighbors there has likely been a great deal of gene flow, as the shared common common ancestors are much closer in time to the present than in the cases I illustrated above.

But when I increased the marker set to ~250,000 SNPs the Yoruba and Esan were clearly distinct populations. This is not surprising. Often today we are wont to assert that ethnic identities are recent historically contingent creations. The reality is many ethnic identities were assembled out of clear and distinct preexistent elements, which had their own history, and so could be reflected in genetics.

That being said, the closer two ethnic groups are geographically and socioculturally, the more likely the two groups are to overlap genetically (more precisely, they can be much harder to differentiate). Sometimes though genetics and culture are very different. The Basque people of northern Spain and southwest France are only mildly genetically distinct from their Romance-speaking neighbors, but they are an ethnolinguistic isolate. The cultural chasm in language is huge. But the genetic chasm is much smaller.

Scandinavia is a coherent ethnolinguistic category which encompasses various northern Germanic people who were relatively untouched by Roman cultural influences. This is in contrast to many Germanic tribes to the south, such as the Franks, who emerged in dynamic tension with the rise of the Roman Empire. The final Scandinavian conversion to Christianity, and so admission into the post-Roman European world, began about two centuries after the conversion of the pagan Saxons by Charlemagne.

Later, the two centuries of the Kalmar Union brought all the modern nations of Scandinavia under one ruler. Today, the concept of Norden, which includes non-Scandinavian Finland, expresses the cultural and social connections of the northern peoples.

And yet genetically the reality is more muddled. Looking at samples of Germans, Danes, Swedes and Norwegians, the geographic patterning is clear. Danes occupy a position between Germans on the one hand, and Norwegians and Swedes on the other. Because of Sami ancestry in many Norwegians and Sami and Finnish ancestry in many Swedes they are genetically distinct from continental Germanic peoples to the south, including Danes.

So what is a Scandinavian? A Scandinavian is a Swede, Dane, or Norwegian (or an Icelander). Scandinavians share 1,000 years of history since their integration into the European system. As a cultural category Scandinavians are clear and distinct.

But as a genetic cluster things are not so clear. First, there is the Danish connection to Germany. This is due to both history and geography. People from northern Germany are clearly genetically close to the Danes. While the Angles and Jutes were from modern Denmark, the Saxons were from northern Germany. Yet in Britain, they fused seamlessly into one people. Before the mass conversion of the continental Saxons under the Carolingians, the cultural barriers between the peoples of Jutland and Saxony must have been marginal at best.

Second, an enormous number of Swedes in particular seem to be highly admixed with Finnic peoples. Many Swedes are highly “Finn-shifted”, both due to Sami assimilation in the past few hundred years, and the long history of Finnish migration into Sweden (which dominated Finland either politically or culturally for nearly 1,000 years). But culturally, and in their ethnolinguistic identity, these people are nothing but Scandinavian at this point.

Going back to the results of the 23andMe user above, who genealogically is more than 60% German, but comes back as 25% German, how to make sense of it? Anyone who has looked at German data realizes that it is very difficult to identify a ‘prototypical’ German. Germans are people who speak Germanic languages, whose ancestors out of the European Bronze Age, when much of Northern European population structure was established. But being at the center of Europe means that Germans have been subject to gene flow by peoples to from all other directions. Also, some ethnic Germans in the eastern regions clearly descend from Slavic tribes, and more recently there were migrations of peoples such as French Huguenots.

A PCA of Danes, English, French, and Germans, show differences across the groups. But Germans overlap a great deal with the English, and a substantial minority overlap with Danes. Also, many more of the Germans are “French-shifted” than the English.

The point is that to be German is to be many things. At least in the context of Northern European peoples.

There are powerful methods of ancestry inference using more information than just genotypes, such as fineSTRUCTURE. And, there are methods relying in rare variants, which allow for much more fine-grained distinctions. But all these methods suffer from the fact that one has to define populations with labels in the first place.  Genetically Germany has several closely related clusters, and all of them are arguably authentically Germany.

Because ethnolinguistic categories are constructions of human history and social preferences they do not always map onto genetic differences at a fine-grain. But, because ethnolinguistic categories were created by humans to give intelligibility to national and cultural variation they are incredibly powerful ways in which to communicate classification to the general public.

Some people believe that personal genomics tests are wrong and false because of the discrepancies as the one I highlight in this post. Actually, the issue is that the language we use shapes our preconceptions, and these companies are attempting to leverage categories and classes which are highly informative to give us a general sense of the patterns they are detecting. Language does not shape reality, but it shapes our perception of reality. To say someone is 25% French-German is more informative to the end-user than to say someone is 25% Generic Continental North European, even though really they are basically the same thing. And yet, if you told someone they were 25% Generic Continental North European they might be less likely to cross-reference that result with their genealogy, because the term is expansive and vague that one does not assume ethnolinguistic precision.

Ultimately I don’t think there is a right answer on this sort of issue. My own preference is clearly to avoid national and ethnic terms to which people bring their own preconceptions. At least when possible.

17 thoughts on “Genomic ancestry tests are not cons, part 2: the problem of ethnicity

  1. I largely agree with you obviously, but one issue you didn’t mention is the simply the reference samples used. Fact is, German testers which are close to the used reference sample usually score very high German, while clearly German people from underrepresented regions score low. So companies like 23andme could greatly increase their precision by including more reference samples.
    That’s not just true for Germans, but for other people and regions too. Especially “Eastern European” and “Balkan” could be broken up into more details and I suspect there were no Czech and Slovak samples included at all or they were categorised in a strange way.

    For South Asian and other world populations more detailed results could be obtained as well. I think the categories and samples used are really most closely to what the typical American customer might find useful. That being done with minimal efforts. Probably I’m wrong, and they still do a great job in comparison to what was before or what most other companies do, but they could be better in so many ways. The clear limitations of the test are not just the result of the problems you described, but a problem of investments into getting better.
    If most customers are satisfied and quite often quote wrong or badly interpreted results with pride, the job was done sufficiently for the company, even if they could do better.
    But I’m pretty sure, sooner or later, if the business model survives and expands, much more precise analysis and results will be available on the market. And one of the key issues will be large scale sampling throughout the continents along clear genetic and ethnic boundaries.

  2. German testers which are close to the used reference sample usually score very high German, while clearly German people from underrepresented regions score low. So companies like 23andme could greatly increase their precision by including more reference samples.

    if you use expand references like this lots of northeast french danes ad central europeans will become german. german can not be specified only by genetics. that is the problem.

  3. The Denmark/Germany/Norway/Sweden PC1 x PC2 plot shows that all four populations seems to stratify into the same three clusters basically purely along the dimension of PC2.

    The Denmark/England/France/Germany plot seems to show the same thing — all four countries similarly stratify into the same three mostly-PC2-based levels.

    What is that PC2 dimension identifying? What are those three different levels of PC2?

  4. Are there any studies that have clustered Europeans (or other populations, e.g. South Asians) for analysis purposes, not into linguistic or national or current political map regional categories, but into the most natural clusters in the genetic data, a la Ancestry with K= a certain number of populations, and then juxtaposes those clusters against modern political and ethnic boundaries?

    For example, I seem to recall you stating that there are clear genetic clusters in Germany-Poland-France that are obscured by current political boundaries (forgive me if I am not remembering it quite right).

    It would be interesting to see what the natural genetic clusters of these regions look like compared to the artificially forged political ones.

    Also, do you really think that “Broadly . . . .” clusters at 23andMe have much to do with recombinations in North American whites as opposed to something that was already predominant in Europe? Is there any data to support that?

  5. Also, do you really think that “Broadly . . . .” clusters at 23andMe have much to do with recombinations in North American whites as opposed to something that was already predominant in Europe? Is there any data to support that?

    they have internal data! some of it is in europeans already. but i’ve seen ‘old stock’ white americans who are overwhelmingly ‘broad.’ remember these people are often mixed of german english irish etc. these are not in the reference pops.

    Are there any studies that have clustered Europeans (or other populations, e.g. South Asians) for analysis purposes, not into linguistic or national or current political map regional categories, but into the most natural clusters in the genetic data, a la Ancestry with K= a certain number of populations, and then juxtaposes those clusters against modern political and ethnic boundaries?

    look at some of the fineSTRUCTURE papers.

    https://www.biorxiv.org/content/early/2017/12/08/230797.figures-only

    What is that PC2 dimension identifying? What are those three different levels of PC2?

    don’t know what that is. might be artifactual?

  6. I would much rather be told that I am 25% Generic Continental North European than to be told I’m 25% French-German. All British people get something like 10% French-German at 23andMe. The national labels confuse people because they take the results literally and think this means they actually have a recent ancestor from France or Germany. One of the problems is that the tests are really aimed at the US market and US customers have different expectations of the tests from their counterparts in Europe.

  7. if you use expand references like this lots of northeast french danes ad central europeans will become german. german can not be specified only by genetics. that is the problem.

    I don’t agree, for different reasons. First you get a clearly German genetic profile in comparisons with most neighbouring populations and even more so if comparing to populations used for the other 23andme designations like Eastern European, Balkan etc. Germans with ancient Slavic ancestry are just shifted towards Poles, but still quite different.

    Secondly the reference sample is much too small, quite obviously, and what you said would suggest the team used the optimal small reference sample, which would represent the ideal German genetic profile most distinguished from the neighbouring clusters, which is rather not the case.

    What you might say about the German population is that it has identifyable subpopulations, especially North vs. South, West vs. East. But even those have just an overlap with related and German influenced populations to a significant degree, like Dutch, Danish, French, Czech and Hungarian. But I think the results could be greatly improvied by expanded and improved reference samples.

    Another problem with 23andme, at times, was that they used geographic but no ethnic reference samples. Like Germans from Russia were put into Eastern European, some Ashkeanzi into Polish and so on, because they had 4 grandparents from the country. They improved since then, but its still a problem and distorts the results, because most of the time ethnicity is more important for the genetic profile of people than geography for obvious reasons.

  8. What you might say about the German population is that it has identifyable subpopulations, especially North vs. South, West vs. East. But even those have just an overlap with related and German influenced populations to a significant degree, like Dutch, Danish, French, Czech and Hungarian. But I think the results could be greatly improvied by expanded and improved reference samples.

    having a difficult time parsing this.

    i have a german sample with large N. i will post the structure at some point to show you what i’m getting at on PCA at last.

  9. Razib has tweeted a link to “Insular Celtic population structure and genomic footprints of migration.” This looks like an analysis of subpopulations for Ireland like Obs is discussing for Germans. Is the difference that one is an island?

  10. i think the british isles are best case scenario for this. also, publishing a paper which makes population-wide assessments is different than providing everyone with an individual prediction.

    that being said, i know for a fact that LivingDNA, which uses this powerful haplotype based method, has problems with germans as well. it may simply be a sampling issue, but ireland and britain being on demarcated islands really helps in curtailing continuous gene flow.

  11. Hungary, Austria, Switzerland and France would share the issues a German sample would have in this type of analysis. It probably would work in a similar or improved fashion compared to British Isles in cases where fine-scale overlap with neighbours is small or regional (Iberia, Finland, Scandinavia, Italy, Greece).

  12. It really is a sampling issue too which could be resolved. IIRC at some point 23andme had more Ashkenazi samples than from all of Central Europe and I don’t know how many times the reference from Britain was larger. I think that tells you something.

    Typically, Germans which are Western shifted, more Celto-Germanic or French-like, have much higher scores than Central, Southern or Eastern ones on 23andme imo.
    But of course, they had to make a decision on whats their priority (French-German vs. Germanic or Central European).
    Anyway, that French-German category really shifts West whats being considered German on 23andme and the German sample seems rather insufficient.

    PCAs: It depends on how you work it out, I saw some which had a good plot for Germans. And the difference between French and German was not generally weaker than the one between e.g. German and Czech/Hungarian, let alone Danish-English. But Scandinavian and British are categories on their own. You also have to consider that Czechs and Hungarians have real and significant Germanic ancestry.

    Germans from underrepresented regions, even if they are more Eastern shifted on a European scale, get quite often more British and Scandinavian as a rule. Obviously thats the common Germanic root. If its not in the German reference, the algorithm tries to put it somewhere. F.e. I have zero British ancestry, but get relatively high scores for it, while some Germans closer to Britain, even on the PCA, but well represented in the reference, get zero or low.
    Thats not just an issue with German diversity but undersampling too, even if there is a real overlap between British and Dutch-German obviously.

  13. It really is a sampling issue too which could be resolved. IIRC at some point 23andme had more Ashkenazi samples than from all of Central Europe and I don’t know how many times the reference from Britain was larger. I think that tells you something.

    there are political reasons why germany is undersampled. that’s pretty obv. but re: ashkenazi, they are a clear and distinct group. it’s really trivial to pick out ashkenazi individuals in the data to bulk up your sample set (some diff btwn litvaks and galicians).

    Obviously thats the common Germanic root.

    northern europeans have only been diversifying from the predominant post-corded ware and bell beaker populations for ~4,000 years. even including likelihood of some local substrate absorption these groups are really really really close together.

    obv some rare variant/ibd method will be able to differentiate populations on the map well assuming you have enough samples. but you still confront the fact that national boundaries/ethnicities start to break-apart at the fine grain.

  14. My grandmother and both her paternal and maternal second and third cousins have verifiable Rhineland German (Hess/Pfalz) ancestry from multiple different ancestors (the rest is Polish), but this shows up strictly as British & Irish with both 23&Me and Nat. Geo.
    None of them actually have British & Irish ancestry.
    Why is this? Bell beakers? Anglo-Saxons?

  15. @Jason: Anglo-Saxons closest modern proxies should be Danish and North-Western German.

    @Razib: I’m well aware of the Ashkenazi case, but that makes the deficit for other populations even more obvious, because for Ashkenazi you don’t need such a large sample from different places for getting appropriate results, while for more diverse populations of larger size, especially in its population history, you need even more and from different regions.

    As for the Northern European diversity, we still don’t have enough data for the time in between EBA and modern times. I’m pretty sure, please correct me if I can be proven wrong by the currently available data, that the differences were stronger before the migration period, the Germanic respectively Slavic expansions. From what I know all borderlines between Germanic and non-Germanic are significant in the European context. Where they are not, we deal with, mostly historically noted, gene flow, like in the case of French, Czech and Hungarian.

    To realise the Germanic (and Slavic) genflow into Czech and Hungarian is better than to have a bloated up Eastern European and Balkan component. Eastern European and Balkan as geographical categories need to be checked for ethnicity. Otherwise they can be pretty meaningless categories if you know the regional diversity because of mostly historical migrations.
    Or would you put Bantu, Boers, Indians and Khoisan in one genetically relevant category, because they live in South Africa? I know the differences in Europe are miniscule in comparison, but the problem stays the same.

    Of course, in comparison to South Asia the break up of Europe is still top notch, because putting everything in South Asia in one category is, well…

    I understand it though, because 23andme did go for geographical origins rather than ethnitiy or race so to say. Even if the ethnic, genetic and racial differences in a region are huge, they are still right if they say “you have ancestry from that region”, regardless of how it came there or what it really is. But for some regions, like Central Europe, it gets really tricky that way. Questions from users and their false interpretation of the results prove that. A lot of people think their family history is wrong because they get false positives because of that. Like British for Eastern Germans, Balkan for Czechs or French-German for Irish. There are overlaps, but there is still room for improvement.
    A company achieving a significant improvment on that matter should be a winner and I see no reason why it shouldnt be possible – not saying it will be easy or I could do it 😉

    And after all that criticism: 23andme did a great job overall. No doubt about that. But what is top-notch today…

  16. ’m pretty sure, please correct me if I can be proven wrong by the currently available data, that the differences were stronger before the migration period, the Germanic respectively Slavic expansions

    the slavic expansions into se europe were significant. the german expansions into the west roman empire, except britain, were not (though perhaps the roman-german boundary did shift west a bit around the rhine).

    And after all that criticism: 23andme did a great job overall. No doubt about that. But what is top-notch today…

    i agree. i like 23andMe’s product.

  17. Finnic admixture is not that widespread in Swedes, it’s mainly found in the sparsely populated north or those with recent Finnish ancestry. The minor ‘eastern’ shift found in Swedes more generally was also found in Iron Age Swedish aDNA. Their early Germanic ancestors were not exactly like modern Danes, although they may have expanded out of Denmark.

Comments are closed.