Recently in Washington D.C. at the International Conference for Genetic Genealogy Spencer Wells ran through a survey of personal genomics, and asserted that over the past year we’ve nearly doubled the number of genotypes on dense marker arrays, from ~1 million to nearly ~2 million. Talking to people that seems about right, 23andMe is approaching the million mark itself alone. Combined with the databases of National Geographic, Ancestry and Family Tree DNA the 2 million mark is surely approaching us. But the presentation wasn’t simply about personal genomics today. He integrated the modern changes to events which occurred over the past generation to bring genetics to the people, and allow its widespread utilization among researchers to explore historical questions. Coming out of the Human Population Genetics Lab at Stanford in the 1990s Wells pointed out that the constituent populations of the HGDP were selected by L. L. Cavalli-Sforza consciously to be biased toward those which were more isolated and genetically less subject to globalization. The theory was to pick up stronger signals of ancient population structure which might be obscured by more recent movements of peoples across the earth.
By and large Cavalli-Sforza’s intuition has been validated. The HGDP data set has yielded enormous insight, first with classical markers (see History and Geography of Human Genes), then during the Y and mtDNA era around 2000, as well as microsatellites and dense SNP arrays in the aughts. I’ve heard that the Sardinian samples that Cavalli-Sforza selected were somewhat less cosmopolitan than other Sardinians that have been collected later, indicative of his personal knowledge of Italian genetic variation on a personal level. But he also had the area knowledge to sample both the western and eastern Pygmy populations of the Congo. The evolutionary history separating these two groups is likely on the order of that dividing inter-continental populations outside of Africa, and the eastern Pygmies in particular are an invaluable reservoir of human genetic variation. To answer a few big questions a data set of 1,000 specially selected humans from across the world turns out to have been sufficient.
But there is a flip side. The answers to the many small questions are going to require much more sample coverage. Wells himself illustrates this reality, as he recounted a story from his Geno 2.0 database where a few percent of ethnic Hungarians exhibit Central Asian uniparental haplotypes. He points out that standard analyses don’t show anything special about Hungarians, but then again you are likely to miss admixtures on the order of a few percent here and there if you are working with sample sizes of less than 100, which is still typical for European nations. Even a large data set, such as the POPRES, is dwarfed by those of commercial firms and National Geographic.
Last week I reported that Afrikaners do seem admixed using the Family Tree DNA database. But that’s not the first strange thing that I saw. When devising myOrigins I had to attempt to tackle the problem of German ancestry and assigning individuals to a particular cluster. I was skeptical for a simple reason: the historical literature was clear that substantial numbers of Germans from the eastern zones of habitation were acculturated Slavs. But I was met with a surprise. Yes, the genetic variation showed substantial amount of admixture with Slavs. Using MDS/PCA you can see that there is an admixture cline with a Polish reference population for individuals who state that both their parents were born in Germany. But there is a second distinct cluster among Germans which overlaps with that of northern France. I don’t have data on the scale of specific geographic points, but my suspicion is that the region around Cologne is genetically distinct because of long-term gene flow with northern France. The scale of Hugenot migration to Germany in the 18th century seems unlikely to explain this.
On the scale of the “big questions” this is trivial and of minor note. But, to answer this sort of question about genetically close populations you need dense geographic coverage and large population sizes. Currently only the personal genomics firms have this, to my knowledge. POPRES and Framingham Heart Study both have sample sizes in the ~5,000 range total. 1000 Genomes is an improvement, but even its sample sizes and coverage is not sufficient. Where are we going? Spencer Wells talks a lot about citizen science. Many of the enthusiasts for scientific genealogy have done very deep analyses of their own genotypes. They certainly have skills to analyze bigger datasets. If computing power is needed then an Amazon cloud server could provide that. The problem though is that customer data can’t just be shared by the big companies themselves. In the short term the ultimate solution is to scale up projects, like Harappa DNA, and formalize their structure so that those who submit their genotype data are protected in some fashion (e.g., exposure of identity). Academic scholars are going to be focusing on whole genomes, more subtle methods, as well as exotic and obscure populations which can get at the big questions. To really see my point, see what Google Scholar returns when you type “genetic population structure germany”.