Substack cometh, and lo it is good. (Pricing)

The issue is with the model, not precision!

The Wirecutter has a thorough review of direct-to-consumer ancestry testing services. Since I now work at a human personal genomics company I’m not going to comment on the merits of any given service. But, I do want to clarify something in regards to the precision of these tests. Before the author quotes Jonathan Marks, he says:

For Jonathan Marks, anthropology professor at University of North Carolina at Charlotte, the big unknown for users is the margin for error with these estimates….

The issue I have with this quote is that the margin of error on these tests is really not that high. Margin of error itself is a precise concept. If you sample 1,000 individuals you’ll have a lower margin of error than if you sample 100 individuals. That’s common sense.

But for direction-to-consumer genomic tests you are sampling 100,000 to 1 million markers on SNP arrays (the exact number used for ancestry inference is often lower than the total number on the array). For ancestry testing you are really interested in the 10 million or so (order of magnitude) markers which vary between population, and a random sampling of 100,000 to 1 million is going to be pretty representative (consider that election year polling usually surveys a few thousand people to represent an electorate of tens of millions).

If you run a package like Admixture you can repeat the calculation for a given individual multiple times. In most cases there is very little variation between replicates in relation to the percentage breakdowns, even though you do a random seed to initialize the process as it begins to stochastically explore the parameter space (the variance is going to be higher if you try to resolve clusters which are extremely phylogenetically close of course).

As I have stated before, the reason these different companies offer varied results is that they start out with different models. When I learned the basic theory around phylogenetics in graduate school the philosophy was definitely Bayesian; vary the model parameters and the model and see what happens. But you can’t really vary the model all the time between customers, can you? It starts to become a nightmare in relation to customer service.

There are certain population clusters that customers are interested in. To provide a service to the public a company has to develop a model that answers those questions which are in demand. If you are designing a model for purely scientific purposes then you’d want to highlight the maximal amount of phylogenetic history. That isn’t always the same though as the history that customers want to know about it. This means that direct-to-consumer ethnicity tests in terms of the specification of their models deviate from pure scientific questions, and result in a log of judgment calls based on company evaluations of their client base.

Addendum: There is a lot of talk about the reference population sets. The main issue is representativeness, not sample size. You don’t really need more than 10-100 individuals from a given population in most cases. But you want to sample the real population diversity that is out there.

6 thoughts on “The issue is with the model, not precision!

  1. So, what you are saying is, if you want to define three models, say, black, white, French, you need a population to define each model. Once you have those populations, then you have those models. Is that what you are saying?

  2. So, what you are saying is, if you want to define three models, say, black, white, French, you need a population to define each model. Once you have those populations, then you have those models. Is that what you are saying?

    there are other specifications in the models besides that…but this is the major issue. what populations do you want to define and test for and what references are you going to define as those populations? when it comes to what consumers want there’s no straightforward guide beyond the most general level.

    the parameter here is the K. in the above K = 3. then you would find the populations you want to define on each of the K’s…though ‘french’ is really a subset of ‘white’ so that would be tricky. but there are ML ways….

  3. Isn’t there also a problem with the scope / time scale of the decomposition? When you’re analyzing GWAS, the goal is to account for ancient ancestral LDs around the markers, and the time scale is measured in millennia. But for practical genealogical application, the genealogically tractable ancestral ethnicities have typical time scales of only about a century. One kind of decomposition would tell you that you have about 10% East Asian DNA; another should tell you that you have no East Asian great-grandparents.

    Using historically recently admixed, and genetically similar, populations, is extremely important for practical genetic genealogy. But in terms of global ancestry, the resolution of such a decomposition hinges on post-admixture drift. Like a putative global ancestry model which would separate Bengali from other South Asian ethnicities would have to rely on drift within the Bengali founder population to distinguish it from its more ancient ancestral South and East Asian progenitors. A tall order, perhaps impossible due to internal structure of the South Asian populations. In the same vein, we are kind of resigned to understanding that Latin Americans can only be deconvoluted into “continental ancestry proportions”, but we don’t really expect to tell apart a Colombian grandfather from a Venezuelan one. We are psychologically prepared to just that by agreeing that it’s all a recent melting pot.
    In Europe though, the implicit understanding is that there hasn’t been a melting pot for very long time, and therefore, an under-served, if perhaps naive, need to tell apart Scots from Anglo-Saxons from Yutes etc. They may all be an outcome of ancient admixing, but that’s not what concerns people. They expect the genes to tell them their ancestor’s origins within the recent century or two, and there may be no way to satisfy this need reliably without local ancestry and deeper marker density?

  4. Isn’t there also a problem with the scope / time scale of the decomposition?

    yes. the clusters that are demanded are not taxonomically equivalent ranks.

Comments are closed.