If you are awake in the year 2019 you have heard of “machine learning.” And, if you listened to my podcast The Insight you know that Andy Kern’s lab at University of Oregon is leveraging machine learning (and “deep learning” and “neural networks”) for population genetics.
Now, obviously in population genetics, you know that models are a big deal. The Hardy-Weinberg model. The coalescent. Various models of selection against which you can test data. This is not a coincidence. In the 20th-century population genetics was a data-poor field and a lot of work was done in the theoretical space since that’s where the work could be done (here’s to you two-locus models of selection from the 1970s!).
In the 2000s genomics transformed the landscape. All of a sudden there was a surfeit of data. On the one hand, this meant that there was a lot of material for models to work on. On the other hand, it turns out that some models aren’t too scalable to big data, nor do they turn out to be very robust (one reason for the persistence of single-locus phylogenetic models around mtDNA and Y is their elegant tractability).
This is where a “bottom-up” machine learning approach comes into the picture. Kern’s group just came out with new a preprint I’ve been hearing about for a while, Predicting Geographic Location from Genetic Variation with Deep Neural Networks:
Most organisms are more closely related to nearby than distant members of their species, creating spatial autocorrelations in genetic data. This allows us to predict the location of origin of a genetic sample by comparing it to a set of samples of known geographic origin. Here we describe a deep learning method, which we call Locator, to accomplish this task faster and more accurately than existing approaches. In simulations, Locator infers sample location to within 4.1 generations of dispersal and runs at least an order of magnitude faster than a recent model-based approach. We leverage Locator’s computational efficiency to predict locations separately in windows across the genome, which allows us to both quantify uncertainty and describe the mosaic ancestry and patterns of geographic mixing that characterize many populations. Applied to whole-genome sequence data from Plasmodium parasites, Anopheles mosquitoes, and global human populations, this approach yields median test errors of 16.9km, 5.7km, and 85km, respectively.
Reads of this weblog can jump to the empirical examples of the HGDP. They make sense, and I especially liked the local ancestry deconvolution analysis and variation in predictive power conditional on recombination.
Sometimes quantity has a quality all its own, and the eye-opening aspect of locator is how it can test a lot of propositions quickly (this is more important in the era of WGS datasets). It’s no joke that dispensing with a model can speed things up.
One minor element I’ll note is that getting locator installed is not trivial from what I have seen. Especially the tensorFlow dependency. So I’ll probably have more updates once I get it up and running myself.