Thursday, September 20, 2007
New paper out in PLOS Genetics, PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations:
Genetic markers can be used to infer population structure, a task that remains a central challenge in many areas of genetics such as population genetics, and the search for susceptibility genes for common disorders. In such settings, it is often desirable to reduce the number of markers needed for structure identification. Existing methods to identify structure informative markers demand prior knowledge of the membership of the studied individuals to predefined populations. In this paper, based on the properties of a powerful dimensionality reduction technique (Principal Components Analysis), we develop a novel algorithm that does not depend on any prior assumptions and can be used to identify a small set of structure informative markers. Our method is very fast even when applied to datasets of hundreds of individuals and millions of markers. We evaluate this method on a large dataset of 11 populations from around the world, as well as data from the HapMap project. We show that, in most cases, we can achieve 99% genotyping savings while at the same time recovering the structure of the studied populations. Finally, we show that our algorithm can also be successfully applied for the identification of structure informative markers when studying populations of complex ancestry. The text has the nitty-gritty for now many SNPs are needed for them to generate the population clusters. They seem to be selling the method on a "faster, cheaper" spin. Jump to the discussion though and something interesting does pop out that doesn't require mediation upon the uses of orthonormal vectors: Our findings demonstrate that to a large extent, SNPs identified as structure informative in one geographic region are not portable for the analysis of populations in a different geographic region, suggesting that the forces that shaped population structure in each geographic region have influenced different parts of the genome. However, analyzing jointly nine populations from around the world and 9,160 SNPs, we showed that using 50 PCA-correlated SNPs we can assign the studied individuals with 100% accuracy to their population of origin.... What could those forces be? You can connect the dots. Finally, a small detail which I thought was interesting: ...As we have shown here, analyzing two independent Puerto Rican datasets, PCA-correlated SNPs can be successfully used to reproduce the structure of admixed populations and predict the ancestry proportions of the studied individuals. Interestingly, we found that interindividual variation across the Native American axis in the Puerto Rican samples that we studied was very low, perhaps depicting the fact that admixture with Native Americans occurred very long ago, and was random over several generations. This seems to make sense, the Taino were absorbed into the Puerto Rican population in the 16th century. Subsequent to this there were hundreds of years of African and European immigration to the island. Nevertheless, a substantial proportion of the mtDNA lineages in Puerto Rican are Amerindian, which implies that the Europeans and Africans were disproportionately male (otherwise European and African mtDNA lineages would have slowly replaced the Amerindian ones over time). Labels: Genetics, human biodiversity |