Substack cometh, and lo it is good. (Pricing)

Why do percentage estimates of “ancestry” vary so much?

When looking at the results in Ancestry DNA, 23andMe, and Family Tree DNA my “East Asian” percentage is:

– 19%
– 13%
– 6%

What’s going on here? In science we often make a distinction between precision and accuracy. Precision is how much your results vary when you re-run an experiment or measurement. Basically, can you reproduce your result? Accuracy refers to how close your measurement is to the true value. A measurement can be quite precise, but consistently off. Similarly, a measurement may be imprecise, but it bounces around the true value…so it is reasonably accurate if you get enough measurements just cancel out the errors (which are random).

The values above are precise. That is, if you got re-tested on a different chip, the results aren’t going to be much different. The tests are using as input variation on 100,000 to 1 million markers, so a small proportion will give different calls than in the earlier test. But that’s not going to change the end result in most instances, even though these methods often have a stochastic element.

But what about accuracy? I am not sure that old chestnuts about accuracy apply in this case, because the percentages that these services provide are summaries and distillations of the underlying variation. The model of precision and accuracy that I learned would be more applicable to the DNA SNP array which returns calls on the variants; that is, how close are the calls of the variant to the true value (last I checked these are arrays are around 99.5% accurate in terms of matching the true state).

What you see when these services pop out a percentage for a given ancestry is the outcome of a series of conscious choices that designers of these tests made keeping in mind what they wanted to get out of these tests. At a high level here’s what’s going on:

  1. You have a model of human population history and dynamics with various parameters
  2. You have data that that varies that you put into that model
  3. You have results which come back with values which are the best fit of that data to the model you specificed

Basically you are asking the computational framework a question, and it is returning its best answer to the question posed. To ask whether the answer is accurate or not is almost not even wrong. The frameworks vary because they are constructed by humans with difference preferences and goals.

Almost, but not totally wrong. You can for example simulate populations whose histories you know, and then test the models on the data you generated. Since you already know the “truth” about the simulated data’s population structure and history, you can see how well your framework can infer what you already know from the patterns of variation in the generated data.

Going back to my results, why do my East Asian percentages vary so much? The short answer is that one of the major variables in the model alluded to above is the nature of the reference population set and the labels you give them.

Looking at Bengalis, the ethnic group I’m from, it is clear that in comparison to other South Asian populations they are East Asian shifted. That is, it seems clear I do have some East Asian ancestry. But how much?

The “simple” answer is to model my ancestry is a mix of two populations, an Indian one and an East Asian one, and then see what the values are for my ancestry across the two components. But here is where semantics becomes important: what is Indian and East Asian? Remember, these are just labels we give to groups of people who share genetic affinities. The labels aren’t “real”, the reality is in the raw read of the sequence. But humans are not capable of really getting anything from millions of raw SNPs assigned to individuals. We have to summarize and re-digest the data.

The simplest explanation for what’s going on here is that the different companies have different populations put into the boxes which are “Indian/South Asian” and “East Asian.” If you are using fundamentally different measuring sticks, then there are going to be problems with doing apples to apples comparisons.

My personal experience is that 23andMe tends to give very high percentages of South Asian ancestry for all South Asians. Because “South Asian” is a very diverse category when tests come back that someone is 95-99% South Asian…it’s not really telling you much. In contrast, some of the other services may be using a small subset of South Asians, who they define as “more typical”, and so giving lower percentages to people from Pakistan and Bengal, who have admixture from neighboring regions to the west and east respectively.*

Something similar can occur with East Asian ancestry. If the “donor” ancestral groups are South Asian and East Asian for me, then the proportions of each is going to vary by how close the donor groups selected by the company is to the true ancestral group. If, for example, Family Tree DNA chose a more Northeastern Asian population than Ancestry DNA, then my East Asian population would vary between the two services because I know my East Asian ancestry is more Southeast Asian.

The moral of the story is that the values you obtain are conditional on the choices you make, and those choices emerge from the process of reducing and distilling the raw genetic variation into a manner which is human interpretable. If the companies decided to use the same model, the would come out with the same results.

* I helped develop an earlier version of MyOrigins, and so can attest to this firsthand.

11 thoughts on “Why do percentage estimates of “ancestry” vary so much?

  1. Razib, do you happen to know what sort of reference populations 23andMe might have for Manchus and Mongolic peoples, if any?

    I have a real life reason for asking (my wife, who has been assigned 9% ancestry of something not Han, but what this 9% has been assigned to is not really credible, because although 3 of her grandparents died during ‘Liberation’ and she never knew them, there would certainly be extended family knowledge of such a grandparent, and they are all totally mystified by it, whereas there is a certain amount of muttering in dark corners about some ‘Xiongnu’ ancestry in the family – which obviously makes no sense in historical terms, but something could be getting lost in my daughter’s translation from Shandong Dialect to English. Something Mongolic would at least fit with the family genealogical whispers. )

  2. Razib,I would like to know which is exactly the ‘reference population’ for the Iberian component, either in 23andMe or in FTDNA. Thanks.

  3. There’s also the issue of “how far you want to go back in time.”

    Ancestry DNA uses shorter DNA sequences to figure estimates compared to 23andme. Short sequences reflect earlier time frame relative to longer sequences.

    Ancestry reflects roughly 1000 or more years ago; 23andMe aims to show about 500 years ago. As you can see both can be correct it just depends on what time period you want to look at. I am not sure about FTDNA.

    Razib, have you ran your raw data through GEDmatch, and looked at your ancestry results?

  4. @John Massey

    Chinese Shandong seems to be quite diverse. Currently there is nothing more
    controversial than the Fudan study on the possible yHg of Confucius or rather the officially registered descendents verified throughout various dynasties.

    A Google translate of the paper
    https://translate.google.com.au/translate?hl=en&sl=zh-CN&u=http://www.ivpp.cas.cn/cbw/rlxxb/xbwzxz/201604/P020160427538687416920.pdf&prev=search

    Tab.2 Frequencies of the Y chromosome haplogroups of Surname Kong in Qufu
    Haplogroup C3 Q1a1 O3 O1 R N O2 C3c G J residual
    Frequency (%) 46.06 27.01 20.66 1.25 1.16 0.98 0.89 0.54 0.09 0.09 1.34

    Big surprise, the main possible yHp groups are C3 and Q1a1, O group came in
    third. Speaking of Manchurian candidate …

    In Chinese mythology, the Xiognu was supposed to be originated from the remnant of the Shang Royal House when they were defeated by the Zhou. And Confucius was supposed to be descended from the Shang. Alternatively, he could be descendent of the sinicized Bei Di from 3500 BCE. On the other hand your wife could be a descendent of the sinicized Tungusic Xianbei people,

    https://en.wikipedia.org/wiki/Change_of_Xianbei_names_to_Han_names

    dux.ie

  5. Thanks for showing a direct comparison of DTC results. This is rare, other than dum-dum “articles” like this http://www.insideedition.com/investigative/21784-how-reliable-are-home-dna-ancestry-tests-investigation-uses-triplets-to-find-out which, for some reason, require triplets or twins in order to submit to multiple samples, and imply that the precision isn’t great within company, but do not show individual results (or whether, for instance, they used a different confidence estimate of 23andme).
    Could you edit the order of percentages so they match the order listed in the sentence above? On reading “When looking at the results in Ancestry DNA, 23andMe, and Family Tree DNA my “East Asian” percentage is:

    – 19%
    – 13%
    – 6%”
    I assumed that 23andMe had predicted 13%, until I searched the article for MyOrigins after reading the footnote (and not finding it. I infer from this article and a cursory google search that Family Tree DNA is an evolution of or synonym of MyOrigns, though that could be made more clear in the footnote.)
    And I’m confused by your glib link to 23andMe Iberian pop. Either you are seeing different text than I at that link, or are finding a clickthrough I am not, or “This dataset includes people of France Basque, Portuguese, or Spanish ancestry.” means something different and more specific to you than I. In my memory in the past 23andMe used to at least publish a table of #individuals in public datasets and private datasets being used as reference populations, though I can’t find it now. I don’t remember seeing their table for precision and recall currently provided here https://www.23andme.com/ancestry-composition-guide/ though.

  6. they used to include detailed numbers. can not find them now. i think that’s because the ref set varies by chip. the new 23andMe chip only overlaps 100K with the HGDP. might not be using that for new customers. but their internal data is huge now.

Comments are closed.