If you meet a model, kill it!

If you are awake in the year 2019 you have heard of “machine learning.” And, if you listened to my podcast The Insight you know that Andy Kern’s lab at University of Oregon is leveraging machine learning (and “deep learning” and “neural networks”) for population genetics.

Now, obviously in population genetics, you know that models are a big deal. The Hardy-Weinberg model. The coalescent. Various models of selection against which you can test data. This is not a coincidence. In the 20th-century population genetics was a data-poor field and a lot of work was done in the theoretical space since that’s where the work could be done (here’s to you two-locus models of selection from the 1970s!).

In the 2000s genomics transformed the landscape. All of a sudden there was a surfeit of data. On the one hand, this meant that there was a lot of material for models to work on. On the other hand, it turns out that some models aren’t too scalable to big data, nor do they turn out to be very robust (one reason for the persistence of single-locus phylogenetic models around mtDNA and Y is their elegant tractability).

This is where a “bottom-up” machine learning approach comes into the picture. Kern’s group just came out with new a preprint I’ve been hearing about for a while, Predicting Geographic Location from Genetic Variation with Deep Neural Networks:

Most organisms are more closely related to nearby than distant members of their species, creating spatial autocorrelations in genetic data. This allows us to predict the location of origin of a genetic sample by comparing it to a set of samples of known geographic origin. Here we describe a deep learning method, which we call Locator, to accomplish this task faster and more accurately than existing approaches. In simulations, Locator infers sample location to within 4.1 generations of dispersal and runs at least an order of magnitude faster than a recent model-based approach. We leverage Locator’s computational efficiency to predict locations separately in windows across the genome, which allows us to both quantify uncertainty and describe the mosaic ancestry and patterns of geographic mixing that characterize many populations. Applied to whole-genome sequence data from Plasmodium parasites, Anopheles mosquitoes, and global human populations, this approach yields median test errors of 16.9km, 5.7km, and 85km, respectively.

Reads of this weblog can jump to the empirical examples of the HGDP. They make sense, and I especially liked the local ancestry deconvolution analysis and variation in predictive power conditional on recombination.

Sometimes quantity has a quality all its own, and the eye-opening aspect of locator is how it can test a lot of propositions quickly (this is more important in the era of WGS datasets).  It’s no joke that dispensing with a model can speed things up.

One minor element I’ll note is that getting locator installed is not trivial from what I have seen. Especially the tensorFlow dependency. So I’ll probably have more updates once I get it up and running myself.

Machine learning swallowing population genetics = understanding patterns in population genomics

Dan Schrider and Andy Kern have a new review preprint out, Machine Learning for Population Genetics: A New Paradigm. On Twitter there has already been a little snark to the effect of “oh, you mean regression?” That’s fair enough, and the preprint would probably benefit from a lower key title, though that’s really the sort of titles journals seem to love.

I would recommend this preprint to two large groups of my readers. There are those with strong computational skills who are curious about biology. It makes it clear why population genomics benefits from machine learning methods. Second, those who are interested or trained in genetics with less of a computational and pop gen background.

Yes, all models are wrong. But some give insight, and some are just not salvageable. In population genomics some of the model-building is obviously starting to yield really fragile results.