|« Iceman died fighting & palaeo-fiction | Gene Expression Front Page | Mirror, Mirror . . . »|
August 13, 2003
Asian genes in Germans?
This man found 21% "Asian" DNA when he took a test by Ancestry By DNA. He is Pennsylvania Dutch by heritage and speculates that the Asian DNA comes from the time of Attila the Hun. I don't know if I buy that explanation (who knows if the ABD test is really that accurate? I believe it's an autosomal DNA test, which are kind of trickier than Y chromosomal or mtDNA tests, though more informative if done well), but it's interesting in any case.... (via Human Races)
The problem with ABD is that it doesn't give a confidence estimate. Their algorithm is a linear classifier applied in SNP space:
Allele frequencies of 56 SNPs (most from pigmentation genes) were dramatically different between groups of unrelated individuals of Asian, African, and European descent, ... A linear classification method was developed for incorporating these SNPs into a classifier model..
Very briefly, the idea is to represent each person by a 56 dimensional vector, with entries being the typed SNPs, and to come up with simple functions that separate the resulting 56 dimensional vector space into sets that correspond to racial groups. This is how it'd play out if you had a 2 dimensional vector with continuous values in each of the entries:
The problem with this "21%" figure is that it includes none of the associated probabilistic data. Very few alleles are *exclusive* to a population, so there is the possibility that some fraction of people will have alleles more common in other populations just by random chance . The use of SNPs (even informative SNPs) rather than full haplotype blocks makes this confounding factor more likely.
As an example, take a look at ALFRED's list of population-related allele frequency variations in a serotonin receptor. The "G861C HincII" polymorphism in this sequence has two common possibilities for the base pair at that location: G and C. Among the Yoruba, G is found 81.6% of the time, while C is found 18.4% of the time. Among the Japanese, G is found 64.3 % of the time, and C is found 35.7% of the time. 
If you ran a naive linear classifier on this data to classify genotypes into Japanese and Yorubans, your algorithm would end up assigning those with G's to the Yoruban group, and those with C's to the Japanese group. Needless to say, you'd get a lot of false classifications (You can work out the exact error rate). But that's the best you can do with such uninformative loci, as the frequencies don't sharply differ between the two populations.
The ABD authors claim that they're using very informative loci, which is not impossible. Certain alleles are almost exclusively found in certain populations:
A. The Fyo allele of the Duffy blood group system occurs in ca. 100% of sub-Saharan Africans and is rare in other populations. ...
Note that the Fyo allele is exactly the kind of allele we're looking for: very common within a population, and almost entirely absent outside that population. If the Dia allele is only found in Asians/Amerinds, but is infrequent within that group, it's not so useful for classification.  Fyo type alleles are the exception and not the rule, however, as you can learn for yourself in even a cursory browsing of ALFRED.
Ok. So, after all that foreplay, you can see where I'm going with this. It is quite likely that the guy in question simply had a chance combination of alleles at the measured SNPs that are more common in East Asian populations. I think that measurements of other SNPs (or, even better, full haplotype blocks) would put him squarely back into the European category. What is absent yet necessary is a probabilistic statement of how likely it is that his allele distribution was due to chance rather than actual East Asian ancestry. 
 Assume these values to be exact for now. A more sophisticated treatment would include error bars to account for sample size effects.
A forensic scientist told the court that semen at the scene was 700 billion times more likely to belong to Reekie than any other man
One can critique the (frequent, unstated) assumption that the typed loci in this trial were truly independent, but the basic principle holds: it's important to give a probabilistic statement of the signal-to-noise ratio.
A few points:
1) It is incorrect to say that ABD does not include a confidence estimate:
In order to know your proportions with 100% confidence, we would have to perform the test for each region of the variable genome, which would make the test very expensive. Since we have not, your results are statistical estimates. We calculate and plot for you all of the estimates that are 2 times, 5 times and 10 times less likely than the MLE. The first contour (black line) around your MLE delimits the space outside of which the points are 2 times less likely, and the second contour (blue line) around the MLE delimits the space outside of which the estimates are 5 less likely than the MLE. The third contour (yellow line) delimits the space outside of which the estimates are 10 times less likely than the MLE. The greater the number of DNA positions we read, the closer these contour lines come to the MLE point. On the triangle plot, the likelihood (probability) that your true value is represented by a different point, than the MLE decreases as you approach the red dot, where the probability is at its maximum (hence, it is called the Maximum Likelihood Estimate or MLE). We could perform the test so that the contour lines are very close to the MLE, however this would require us to sequence a much larger collection of markers. To keep the test affordable, we limit the survey a reasonable number of markers that are sufficient for you to know with good confidence what your proportions are. The yellow circle (10X contour) is also referred to as the “one-log interval” and is taken generally taken as a scientific level of confidence.
What I intended to say was that without a confidence estimate, this man's laborious, wordy speculation on his admixture percentages is meaningless . Given fuzzy overlapping sets, classifications of points near the statistical decision boundary will have higher error rates than classification of points that are close to the interiors of these fuzzy sets. See the comments below for an elaboration on this point.
It would be quite useful to know exactly which loci ABD typed and whether they ensured pairwise independence of loci (e.g. by selecting genes on different chromosomes). However, the Frudakis "Ancestry by DNA" paper is in a rather obscure journal. If anyone has a copy, please send me email. Also, I wish to reiterate that in principle it is of course possible to assign ancestry and assess admixture. My skepticism is confined to ABD's methodology.