« Iceman died fighting & palaeo-fiction | Gene Expression Front Page | Mirror, Mirror . . . »
August 13, 2003

Asian genes in Germans?

This man found 21% "Asian" DNA when he took a test by Ancestry By DNA. He is Pennsylvania Dutch by heritage and speculates that the Asian DNA comes from the time of Attila the Hun. I don't know if I buy that explanation (who knows if the ABD test is really that accurate? I believe it's an autosomal DNA test, which are kind of trickier than Y chromosomal or mtDNA tests, though more informative if done well), but it's interesting in any case.... (via Human Races)

Godless comments:

The problem with ABD is that it doesn't give a confidence estimate. Their algorithm is a linear classifier applied in SNP space:

Allele frequencies of 56 SNPs (most from pigmentation genes) were dramatically different between groups of unrelated individuals of Asian, African, and European descent, ... A linear classification method was developed for incorporating these SNPs into a classifier model..

Very briefly, the idea is to represent each person by a 56 dimensional vector, with entries being the typed SNPs, and to come up with simple functions that separate the resulting 56 dimensional vector space into sets that correspond to racial groups. This is how it'd play out if you had a 2 dimensional vector with continuous values in each of the entries:

The problem with this "21%" figure is that it includes none of the associated probabilistic data. Very few alleles are *exclusive* to a population, so there is the possibility that some fraction of people will have alleles more common in other populations just by random chance . The use of SNPs (even informative SNPs) rather than full haplotype blocks makes this confounding factor more likely.

As an example, take a look at ALFRED's list of population-related allele frequency variations in a serotonin receptor. The "G861C HincII" polymorphism in this sequence has two common possibilities for the base pair at that location: G and C. Among the Yoruba, G is found 81.6% of the time, while C is found 18.4% of the time. Among the Japanese, G is found 64.3 % of the time, and C is found 35.7% of the time. [1]

If you ran a naive linear classifier on this data to classify genotypes into Japanese and Yorubans, your algorithm would end up assigning those with G's to the Yoruban group, and those with C's to the Japanese group. Needless to say, you'd get a lot of false classifications (You can work out the exact error rate). But that's the best you can do with such uninformative loci, as the frequencies don't sharply differ between the two populations.

The ABD authors claim that they're using very informative loci, which is not impossible. Certain alleles are almost exclusively found in certain populations:

A. The Fyo allele of the Duffy blood group system occurs in ca. 100% of sub-Saharan Africans and is rare in other populations. ...

B. The Dia allele of the Diego blood group system is found only in Asians and Amerinds and supports the close genetic affinity of these populations.

Note that the Fyo allele is exactly the kind of allele we're looking for: very common within a population, and almost entirely absent outside that population. If the Dia allele is only found in Asians/Amerinds, but is infrequent within that group, it's not so useful for classification. [3] Fyo type alleles are the exception and not the rule, however, as you can learn for yourself in even a cursory browsing of ALFRED.

Ok. So, after all that foreplay, you can see where I'm going with this. It is quite likely that the guy in question simply had a chance combination of alleles at the measured SNPs that are more common in East Asian populations. I think that measurements of other SNPs (or, even better, full haplotype blocks) would put him squarely back into the European category. What is absent yet necessary is a probabilistic statement of how likely it is that his allele distribution was due to chance rather than actual East Asian ancestry. [3]

[1] Assume these values to be exact for now. A more sophisticated treatment would include error bars to account for sample size effects.
[2] I don't know the Dia frequency.
[3] This is what is done in rape trials, for example, when DNA evidence is presented.

A forensic scientist told the court that semen at the scene was 700 billion times more likely to belong to Reekie than any other man

One can critique the (frequent, unstated) assumption that the typed loci in this trial were truly independent, but the basic principle holds: it's important to give a probabilistic statement of the signal-to-noise ratio.

Godless clarifies:

A few points:

1) It is incorrect to say that ABD does not include a confidence estimate:

In order to know your proportions with 100% confidence, we would have to perform the test for each region of the variable genome, which would make the test very expensive. Since we have not, your results are statistical estimates. We calculate and plot for you all of the estimates that are 2 times, 5 times and 10 times less likely than the MLE. The first contour (black line) around your MLE delimits the space outside of which the points are 2 times less likely, and the second contour (blue line) around the MLE delimits the space outside of which the estimates are 5 less likely than the MLE. The third contour (yellow line) delimits the space outside of which the estimates are 10 times less likely than the MLE. The greater the number of DNA positions we read, the closer these contour lines come to the MLE point. On the triangle plot, the likelihood (probability) that your true value is represented by a different point, than the MLE decreases as you approach the red dot, where the probability is at its maximum (hence, it is called the Maximum Likelihood Estimate or MLE). We could perform the test so that the contour lines are very close to the MLE, however this would require us to sequence a much larger collection of markers. To keep the test affordable, we limit the survey a reasonable number of markers that are sufficient for you to know with good confidence what your proportions are. The yellow circle (10X contour) is also referred to as the “one-log interval” and is taken generally taken as a scientific level of confidence.

What I intended to say was that without a confidence estimate, this man's laborious, wordy speculation on his admixture percentages is meaningless . Given fuzzy overlapping sets, classifications of points near the statistical decision boundary will have higher error rates than classification of points that are close to the interiors of these fuzzy sets. See the comments below for an elaboration on this point.

It would be quite useful to know exactly which loci ABD typed and whether they ensured pairwise independence of loci (e.g. by selecting genes on different chromosomes). However, the Frudakis "Ancestry by DNA" paper is in a rather obscure journal. If anyone has a copy, please send me email. Also, I wish to reiterate that in principle it is of course possible to assign ancestry and assess admixture. My skepticism is confined to ABD's methodology.

Posted by razib at 01:01 PM




I would probably also show some asian ancestry - some of my father's ancestors of whom we have pictures have slightly mongoloid features, and wouldn't look out of place in say Tajikistan.

Posted by: bbartlog at August 13, 2003 02:06 PM


so would i-i have cousins that could pass for burmese.... (my family is actually from the eastern part of eastern bengal, so that makes sense)

Posted by: razib at August 13, 2003 02:46 PM


Interesting irony if true. The Germans remind me a lot of the Japanese and I admire their society precisely for that reason - efficient, frugal, disciplined, future-oriented, strong work ethic (temperamentally I'm a Protestant at heart :))

Other interesting ironies
1) Hitler did consider the Japanese to be 'honorary Aryans'
2) Hitler funded a hare brained archaeological quest for the origins of the Aryans. His researchers ended up somewhere around Tibet and Nepal - I think the movie '7 years in Tibet' is based on precisely this incident. The moviemakers were a bit embarassed when it turned out that the character played by Brad Pitt was an enthusiastic Nazi though on good terms with the young Dalai Lama

Posted by: Jason Soon at August 13, 2003 04:04 PM


Heinrich Harrer (sp?) I think was his name. The filmmakers likely knew his whole story when they made the movie; I think you would have to read more about him to decide how much of a Nazi he was. The film certainly doesn't say much about it.

Posted by: bbartlog at August 13, 2003 05:18 PM


One way in which the Germans are quite unlike the Japanese is their, how shall I say, poor customer service skills. In America and in Japan it's pretty unthinkable that for example a counterperson in a deli would tell you to hurry up and make up your mind, but things like that happened to my mother repeatedly in recent years (she lives in Hamburg and teaches English). The relationship between customer and vendor is much more equal there - none of this 'the customer is always right'...

Posted by: bbartlog at August 13, 2003 05:21 PM


In that respect they're a lot more similar to the Chinese.

Whilst I love authentic Chinese food, customer service at such restaurants is almost always poor unless I know the proprietor personally.

Posted by: Johnny Rotten at August 13, 2003 05:50 PM


Off-topic, but someone mentioned being from Tajikistan and having Mongoloid features.

For a decent look at the wide "variety" of "facial features" in the former Soviet states including Russia, check out this mail-order bride site:

www.bride.ru

One one end of the spectrum you have blondes who look like they've been uprooted from Norway; on the other end people who could pass for Tibetan, some of them even Nepalese. I know what you're thinking but some of the asian-looking aren't from states adjoining china or mongolia; migration I suppose.

Posted by: Johnny Rotten at August 13, 2003 10:12 PM


the idea is to represent each person by a 56 dimensional vector
Well, well! So the idea I heard about representing people as vectors in n-dimensional spaces isn't nutball after all. That was in the context of personality testing which undoubtedly much less objective but still.
I'd like to learn more about n-dimensional vectors as representations of data. Where could I find this?

Posted by: John Purdy at August 14, 2003 01:52 AM


The whole point is that with multiple loci, the chance event of an individual having all the alleles which are found in "another" population is very small. And, if someone has many "East Asian" alleles, while being "European", then it may just be a very improbable event of normal variation within Europeans, or more likely he just has some "East Asian" ancestry.

Posted by: Dienekes at August 14, 2003 03:47 AM


GC:
Yeah, my math never gets used although I have a longstanding interest in the subject. Probably be pretty tough. Still, I've been thinking for a while I should upgrade in this area. It's difficult to ascertain the veracity of scientific statements without getting the underlying math. I'll check out the book and see what I think.

Posted by: John Purdy at August 14, 2003 12:39 PM


More nonsense from this website. First - ABD DOES give confidence intervals for all their admixture estimates..it is in the chart they give for their data. Second, the number of population-specific alleles are too small to quatitatively measure racial admixture. Third, ABD uses alleles which have high "delta values" in populations, and use more than 40 to estimate the European-East Asian divide. If you had only 5 such loci, and if for each, Europeans had a 80% chance for form X, and and a 20% chance for form Y, the chance of any unmixed European having 5 Y's at all 5 loci is 0.032%.
Random chance, indeed!

Posted by: Rienzi at August 27, 2003 11:22 AM


Yeah.."Godless" uses as an example _one_ locus. How about 40? Note also that the Frudakis paper presents just the initial methodologies of the ABD technology. If you'd bother reading it, you'd find that it openly states that the algorithm used in the paper is ONLY good for determining majority ancestry, and that other methods need be used for determining minor ancestry. These other methods were developed for the ABD test, which will be updated soon.
Scientists _should be_ aware that population genetics is primarily a statistical science, looking at differences in gene frequencies. When gene frequencies differ between populations, one ca estimate ancestral proportions.
One wonders if much of the criticism of the test is due to the discovery of East Asian admixture in Northern European and South Asian populations.

Posted by: Rienzi at August 27, 2003 11:30 AM