It’s raining selective sweeps

A week ago a very cool new preprint came out, Identifying loci under positive selection in complex population histories. It’s something that you can’t even imagine just ten years ago. The authors basically figure out ways to identify deviations of markers from expected allele frequency given a null neutral evolutionary model. The method is put first, which I really like, before getting to results or discussion. Additionally, they did a lot of simulation ahead of time. The sort of simulation that is really not possible before the sort of computational resources we have now.

Here’s the abstract:

Detailed modeling of a species’ history is of prime importance for understanding how natural selection operates over time. Most methods designed to detect positive selection along sequenced genomes, however, use simplified representations of past histories as null models of genetic drift. Here, we present the first method that can detect signatures of strong local adaptation across the genome using arbitrarily complex admixture graphs, which are typically used to describe the history of past divergence and admixture events among any number of populations. The method – called Graph-aware Retrieval of Selective Sweeps (GRoSS) – has good power to detect loci in the genome with strong evidence for past selective sweeps and can also identify which branch of the graph was most affected by the sweep. As evidence of its utility, we apply the method to bovine, codfish and human population genomic data containing multiple population panels related in complex ways. We find new candidate genes for important adaptive functions, including immunity and metabolism in under-studied human populations, as well as muscle mass, milk production and tameness in particular bovine breeds. We are also able to pinpoint the emergence of large regions of differentiation due to inversions in the history of Atlantic codfish.

On a related note in regards to selection, On the well-founded enthusiasm for soft sweeps in humans: a reply to Harris, Sackman, and Jensen. The authors are responding to a recent preprint criticizing their earlier work. The reason that it’s fascinating to me is that these sorts of arguments today are really concrete and not so theoretical. There’s a lot of data for analytic techinques to chew through, and computation has really transformed the possibilities.

A generation ago these sorts of debates would be a sequence of “you’re wrong!” vs. “no, you’re wrong!” Today the disputes involve a lot of data, and so have a reasonable chance of resolution.

The first preprint identifies the usual candidates in humans that you normally see, and expected targets in cattle and cod. Sure, that will given biologists more interested in mechanisms and pathways things to chew upon, but imagine once researchers have large numbers of genomes for thousands and thousands of species. Then they’ll be testing deviations from neutral allele frequencies across many trees, and getting a more general and abstract sense of the parameter that selection explores, conditional on particularities o evolutionary history.

This is why I’m excited about plans to sequence lots and lots of species.

The post-neutral human genome (the Kern-Hahn era)

If you have any background in evolutionary biology you are probably aware of the controversy around the neutral theory of molecular evolution. Fundamentally a theoretical framework, and instrumentally a null hypothesis, it came to the foreground in the 1970s just as empirical molecular data in evolutionary was becoming a thing.

At the same time that Motoo Kimura and colleagues were developing the formal mathematical framework for the neutral theory, empirical evolutionary geneticists were leveraging molecular biology to more directly assay natural allelic variation. In 1966 Richard Lewontin and John Hubby presented results which suggested far more variation than they had been expecting. Lewontin argued in the early 1970s that their data and the neutral model actually was a natural extension of the “classical” model of expected polymorphism as outlined by R. A. Fisher, as opposed to the “balance school” of Sewall Wright. In short, Lewontin proposed that the extent of polymorphism was too great to explain in the context of the dynamics of the balance school (e.g., segregation load and its impact on fitness), where numerous selective forces maintained variation. The classical school emphasized both strong selective sweeps on favored alleles and strong constraint against most new mutations.

And yet one might expect low levels of polymorphism from the classical school. The way in which the neutral framework was a more natural extension of this model is that even if most inter-specific variation, most substitutions across species, are due to selectively neutral variants, most variants could nevertheless be deleterious and so constrained. Alleles which increase in frequency may have done so through positive selection, or, just random drift. Not balancing forces like diversifying selection and overdominance.

The general argument around neutral theory generated much acrimony and spilled out from the borders of population genetics and molecular evolution to evolutionary biology writ large. Stephen Jay Gould, Simon Conway Morris, and Richard Dawkins, were all under the shadow of neutral theory in their meta-scientific spats about adaptation and contingency.

That was then, this is now. I’ve already stated that sometimes people overplay how much genomics has transformed our understanding of evolutionary biology. But in the arguments around neutral theory, I do think it has had a salubrious impact on the tone and quality of the discourse. Neutral theory and the great controversies flowered and flourished in an age where there was some empirical data to support everyone’s position. But there was never enough data to resolve the debates.

From where I stand, I think we’re moving beyond that phase in our intellectual history. To be frank, some of the older researchers who came up in the trenches when Kimura and his bête noire John Gillespie were engaged a scientific dispute which went beyond conventional collegiality seem to retain the scars of that era. But younger scientists are more sanguine, whatever their current position might be because they anticipate that the data will ultimately adjudicate, because there is so much of it.

With that historical context, consider a new paper, Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences:

Disentangling the effect on genomic diversity of natural selection from that of demography is notoriously difficult, but necessary to properly reconstruct the history of species. Here, we use high-quality human genomic data to show that purifying selection at linked sites (i.e. background selection, BGS) and GC-biased gene conversion (gBGC) together affect as much as 95% of the variants of our genome. We find that the magnitude and relative importance of BGS and gBGC are largely determined by variation in recombination rate and base composition. Importantly, synonymous sites and non-transcribed regions are also affected, albeit to different degrees. Their use for demographic inference can lead to strong biases. However, by conditioning on genomic regions with recombination rates above 1.5 cM/Mb and mutation types (C↔G, A↔T), we identify a set of SNPs that is mostly unaffected by BGS or gBGC, and that avoids these biases in the reconstruction of human history.

This is not an entirely surprising result. Some researchers in human genetics have been arguing for the pervasiveness of background selection, selection against deleterious alleles which effects nearby regions, for nearly a decade. In contrast, there are others who argue selective sweeps driven by positive selection are important in determining variation. Unlike the 1970s and 1980s these researchers don’t evince much acrimony, in part because the data keeps coming, and ultimately they’ll probably converge on the same position. And, the results may differ by species or taxon.

If you want a less technical overview than the paper, Kelley Harris has an excellent comment accompanying it. If you want to know what I mean by the Kern-Han era, it’s a joke due to the publication of The Neutral Theory in Light of Natural Selection.

Finally, some of you might wonder about the implications for demographic inference which preoccupies me so much on this weblog. In the big picture, it probably won’t change a lot, but it will be important for the details. So this is a step forward. That being said, the possibility of variable mutation rates and recombination rates across time and between lineages are also probably quite important.

The population genetic structure of China (through noninvasive prenatal testing)


This week a big whole genome analysis of China was published in Cell, Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History. The abstract:

We analyze whole-genome sequencing data from 141,431 Chinese women generated for non-invasive prenatal testing (NIPT). We use these data to characterize the population genetic structure and to investigate genetic associations with maternal and infectious traits. We show that the present day distribution of alleles is a function of both ancient migration and very recent population movements. We reveal novel phenotype-genotype associations, including several replicated associations with height and BMI, an association between maternal age and EMB, and between twin pregnancy and NRG1. Finally, we identify a unique pattern of circulating viral DNA in plasma with high prevalence of hepatitis B and other clinically relevant maternal infections. A GWAS for viral infections identifies an exceptionally strong association between integrated herpesvirus 6 and MOV10L1, which affects piwi-interacting RNA (piRNA) processing and PIWI protein function. These findings demonstrate the great value and potential of accumulating NIPT data for worldwide medical and genetic analyses.

In The New York Times write-up there is an interesting detail, “This study served as proof-of-concept, he added. His team is moving forward on evaluating prenatal testing data from more than 3.5 million Chinese people.” So what he’s saying is that this study with >100,000 individuals is a “pilot study.” Let that sink in.

Read More

The derived SNP that causes dry earwax was not found in all non-Africans

A new paper on Chinese genomics using hundreds of thousands of low-coverage data from NIPT screenings is making some waves. I’ll probably talk about the paper at some point. But I want to highlight the frequency of rs17822931 in Han Chinese. It’s pretty incredible how high it is.

Because the derived variant SNP, which is correlated with dry flaky earwax when present in homozygote genotypes, is also associated with less body odor, it has been studied extensively by East Asian geneticists. Basically, individuals who are homozygote for the ancestral SNP, which is the norm in Europe, the Middle East, and Africa, tend to have more body odor, and in societies and contexts where this is offensive these people are subject to more ostracism in East Asia as they are a minority (some of the studies in Japan were motivated by conscripts who elicited complaints from their colleagues).

The relatively low frequency in Guangxi is to be expected. This province was Sinicized only recently. As in, the last 500 years. And it still retains a huge ethnic minority population, and many of the Han in the province likely have that ancestry. But the question still arises: why do the Han have such a high frequency of rs17822931?

Here’s a plot of frequencies:

Read More

Chinese and Indian American population genetic structure

In Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past David Reich makes the observation that India is a nation of many different ethnicities, while China is dominated by a single ethnicity, the Han. This is obviously true, more or less. Even today the vast majority of Indians seem to be marrying with their own communities, jati.

Over the years I’ve collected many different genotypes of Americans of various origins who have purchased personal genomics kits, and given me their raw results. I decided to go through my collection and strip detailed ethnic labels and simply group together all those individuals from India, and China, who have had their genotypes done from one of the major services.

I suspect that these individuals are representative of “Indian Americans” and “Chinese Americans.” So what’s their genetic structure?

Read More

How related should you expect relatives to be?

Like many Americans in the year 2018 I’ve got a whole pedigree plugged into personal genomic services. I’m talking from grandchild to grandparent to great-aunt/uncles. A non-trivial pedigree. So we as a family look closely at these patterns, and we’re not surprised at this point to see really high correlations in some cases compared to what you’d expect (or low).

This means that you can see empirically the variation between relatives of the same nominal degree of separation from a person of interest. For example, each of my children’s’ grandparents contributes 25% of their autosomal genome without any prior information. But I actually know the variation of contribution empirically. For example, my father is enriched in my daughter. My mother is my sons.

The sample principle applies to siblings. Though they should be 50% related on their autosomal genome, it turns out there is variation. I’ve seen some papers large data sets (e.g., 20,000 sibling pairs) which gives a standard deviation of 3.7% in relatedness. But what about other degrees of relation?

Read More

David Burbridge’s 10 questions for A. W. F. Edwards In 2006

A few years ago I watched a documentary about the rise of American-influenced rock music in Britain in the 1960s. At some point, one of the Beatles, probably Paul McCartney, or otherwise Eric Clapton, was quoted as saying that they wanted to introduce Americans to “their famous people.” Though patronizing and probably wrong, what they were talking about is that there were particular blues musicians who were very influential in some British circles were lingering in obscurity in the United States of America due to racial prejudice. The bigger picture is that there are brilliant people who for whatever reason are not particularly well known to the general public.

This is why I am now periodically “re-upping” interviews with scientists that we’ve done on this weblog over the past 15 years. These are people who should be more famous. But aren’t necessarily.

In 2006 David Burbridge, a contributor this weblog and a historian of things Galtonian, interviewed the statistical geneticist A. W. F. Edwards. Edwards was one of R. A. Fisher’s last students, so he has a connection to a period if history that is passing us by.

I do want to say that his book, Foundations of Mathematical Genetics, really gave me a lot of insights when I first read it in 2005 and began to be deeply interested in pop gen. It’s dense. But short. Additionally, I have also noticed that there is now a book out which is a collection of Edwards’ papers, with commentaries, Phylogenetic Inference, Selection Theory, and a History of Science. Presumably, it is like W. D. Hamilton’s Narrow Roads of Gene Land series. I wish more eminent researchers would publish these sorts of compilations near the end of their careers.

There have been no edits below (notice the British spelling). But I did add some links!

David’s interview begins after this point:

Read More

My interview of James F. Crow in 2006

Since the death of L. L. Cavalli-Sforza I’ve been thinking about the great scientists who have passed on. Last fall, I mentioned that Mel Green had died. There was a marginal personal connection there. I had the privilege to talk to Green at length about sundry issues, often nonscientific. He was someone who been doing science so long he had talked to Charles Davenport in the flesh (he was not complimentary of Davenport’s understanding of Mendelian principles). It was like engaging with a history book!

A few months before I emailed Cavalli-Sforza, I had sent a message on a lark to James F. Crow. It was really a rather random thing, I never thought that Crow would respond. But in fact he emailed me right back! And he answered 10 questions from me, as you can see below the fold. The truth is I probably wouldn’t have thought to try and get in touch with Cavalli-Sforza if it hadn’t been so easy with Crow.

If you are involved in population genetics you know who Crow is. No introduction needed. Some of the people he supervised, such as Joe Felsenstein, have gone on to transform evolutionary biology in their own turn.

Born in 1916, Crow’s scientific career spanned the emergence of population genetics as a mature field, to the discovery of the importance of DNA, to molecular evolution & genomics. He had a long collaboration with Motoo Kimura, the Japanese geneticist instrumental in pushing forward the development of “neutral theory.”

He died in 2012.

Below are the questions I asked 12 years ago. My interests have changed somewhat, so it’s interesting to see what I was curious about back then. And of course fascinating to read Crow’s responses.
Read More

Local ancestry deconvolution made simpler (?)

I’ve been waiting for a local ancestry deconvolution method to come out of Simon Myers’ group for a few years. Well, I think we’re there, Fine-scale Inference of Ancestry Segments without Prior Knowledge of Admixing Groups. Here’s the abstract:

We present an algorithm for inferring ancestry segments and characterizing admixture events, which involve an arbitrary number of genetically differentiated groups coming together. This allows inference of the demographic history of the species, properties of admixing groups, identification of signatures of natural selection, and may aid disease gene mapping. The algorithm employs nested hidden Markov models to obtain local ancestry estimation along the genome for each admixed individual. In a range of simulations, the accuracy of these estimates equals or exceeds leading existing methods that return local ancestry. Moreover, and unlike these approaches, we do not require any prior knowledge of the relationship between sub-groups of donor reference haplotypes and the unseen mixing ancestral populations. Instead, our approach infers these in terms of conditional “copying probabilities”. In application to the Human Genome Diversity Panel we corroborate many previously inferred admixture events (e.g. an ancient admixture event in the Kalash). We further identify novel events such as complex 4-way admixture in San-Khomani individuals, and show that Eastern European populations possess 1-5% ancestry from a group resembling modern-day central Asians. We also identify evidence of recent natural selection favouring sub-Saharan ancestry at the HLA region, across North African individuals. We make available an R and C ++ software library, which we term MOSAIC (which stands for MOSAIC Organises Segments of Ancestry In Chromosomes).

The truth is I’ve only done a quick skim of the preprint and not run the method myself to see how it works. But to be honest I can’t see where the part about Eastern Europeans is in the manuscript (I checked the supporting text)? That being said, if you run a PCA many Northern and most Eastern Europeans are clearly shifted toward East Asians compared to Southern Europeans. So I accept it.

In any case, always remember, all models are wrong. But some of them have insight.

Tutorial to run PCA, Admixture, Treemix and pairwise Fst in one command


Today on Twitter I stated that “if the average person knew how to run PCA with plink and visualize with R they wouldn’t need to ask me anything.” What I meant by this is that the average person often asks me “Razib, is population X closer to population Y than Z?” To answer this sort of question I dig through my datasets and run a few exploratory analyses, and get back to them.

I’ve been meaning to write up and distribute a “quickstart” for a while to help people do their own analyses. So here I go.

The audience of this post is probably two-fold:

  1. “Trainees” who are starting graduate school and want to dig in quickly into empirical data sets while they’re really getting a handle on things. This tutorial will probably suffice for a week. You should quickly move on to three population and four population tests, and Eigensoft and AdmixTools. As well fineStructure
  2. The larger audience is technically oriented readers who are not, and never will be, geneticists professionally. 

What do you need? First, you need to be able to work in a Linux or Linux-environment. I work both in Ubuntu and on a Mac, but this tutorial and these scripts were tested on Ubuntu. They should work OK on a Mac, but there may need to be some modifications on the bash scripts and such.

Assuming you have a Linux environment, you need to download this zip or tar.xz file. Once you open this file it should decompress a folderancestry/.

There are a bunch of files in there. Some of them are scripts I wrote. Some of them are output files that aren’t cleaned up. Some of them are packages that you’ve heard of. Of the latter:

  • admixture
  • plink
  • treemix

You can find these online too, though these versions should work out of the box on Ubuntu. If you have a Mac, you need the Mac versions. Just replace the Mac versions into the folderancestry/. You may need some libraries installed into Ubuntu too if you recompile yourselfs. Check the errors and make search engines your friends.

You will need to install R (or R Studio). If you are running Mac or Ubuntu on the command line you know how to get R. If not, Google it.

I also put some data in the file. In particular, a plink set of files Est1000HGDP. These are merged from the Estonian Biocentre, HGDP, and 1000 Genomes. There are 4,899 individuals in the data, with 135,000 high quality SNPs (very low missingness).

If you look in the “family” file you will see an important part of the structure. So do:

less Est1000HGDP.fam

You’ll see something like this:
Abhkasians abh154 0 0 1 -9
Abhkasians abh165 0 0 1 -9
Abkhazian abkhazian1_1m 0 0 2 -9
Abkhazian abkhazian5_1m 0 0 1 -9
Abkhazian abkhazian6_1m 0 0 1 -9
AfricanBarbados HG01879 0 0 0 -9
AfricanBarbados HG01880 0 0 0 -9

There are 4,899 rows corresponding to each individual. I have used the first column to label the ethnic/group identity. The second column is the individual ID. You can ignore the last 4 columns.

There is no way you want to analyze all the different ethnic groups. Usually, you want to look at a few. For that, you can use lots of commands, but what you need is a subset of the rows above. The grep command matches and returns rows with particular patterns. It’s handy. Let’s say I want just Yoruba, British (who are in the group GreatBritain), Gujurati, Han Chinese, and Druze. The command below will work (note that Han matches HanBeijing, Han_S, Han_N, etc.).

grep "Yoruba\|Great\|Guj\|Han\|Druze" Est1000HGDP.fam > keep.txt

The file keep.txt has the individuals you want. Now you put it through plink to generate a new file:

./plink --bfile Est1000HGDP --keep keep.txt --make-bed --out EstSubset

This new file has only 634 individuals. That’s more manageable. But more important is that there are far fewer groups for visualization and analysis.

As for that analysis, I have a Perl script with a bash script within it (and some system commands). Here is what they do:

1) they perform PCA to 10 dimensions
2) then they run admixture on the number of K clusters you want (unsupervised), and generate a .csv file you can look at
3) then I wrote a script to do pairwise Fst between populations, and output the data into a text file
4) finally, I create the input file necessary for the treemix package and then run treemix with the number of migrations you want

There are lots of parameters and specifications for these packages. You don’t get those unless you to edit the scripts or make them more extensible (I have versions that are more flexible but I think newbies will just get confused so I’m keeping it simple).

Assuming I create the plink file above, running the following commands mean that admixture does K = 2 and treemix does 1 migration edge (that is, -m 1). The PCA and pairwise Fst automatically runs.

perl pairwise.perl EstSubset 2 1

Just walk away from your box for a while. The admixture will take the longest. If you want to speed it up, figure out how many cores you have, and edit the file makecluster.sh, go to line 16 where you see admixture. If you have 4 cores, then type -j4 as a parameter. It will speed admixture up and hog all your cores.

There is as .csv that has the admixture output. EstSubset.admix.csv. If you open it you see something like this:
Druze HGDP00603 0.550210 0.449790
Druze HGDP00604 0.569070 0.430930
Druze HGDP00605 0.562854 0.437146
Druze HGDP00606 0.555205 0.444795
GreatBritain HG00096 0.598871 0.401129
GreatBritain HG00097 0.590040 0.409960
GreatBritain HG00099 0.592654 0.407346
GreatBritain HG00100 0.590847 0.409153

Column 1 will always be the group, column 2 the individual, and all subsequent columns will be the K’s. Since K = 2, there are two columns. Space separated. You should be able to open the .csv or process it however you want to process it.

You’ll also see two other files: plink.eigenval plink.eigenvec. These are generic output files for the PCA. The .eigenvec file has the individuals along with the values for each PC. The .eigenval file shows the magnitude of the dimension. It looks like this:
68.7974
38.4125
7.16859
3.3837
2.05858
1.85725
1.73196
1.63946
1.56449
1.53666

Basically, this means that PC 1 explains twice as much of the variance as PC 2. Beyond PC 4 it looks like they’re really bunched together. You can open up this file as a .csv and visualize it however you like. But I gave you an R script. It’s RPCA.R.

You need to install some packages. First, open R or R studio. If you want to go command line at the terminal, type R. Then type:
install.packages("ggplot2")
install.packages("reshape2")
install.packages("plyr")
install.packages("ape")
install.packages("igraph")
install.packages("ggplot2")

Once those packages are loaded you can use the script:
source("RPCA.R")

Then, to generate the plot at the top of this post:
plinkPCA()

There are some useful parameters in this function. The plot to the left adds some shape labels to highlight two populations. A third population I label by individual ID. This second is important if you want to do outlier pruning, since there are mislabels, or just plain outlier individuals, in a lot of data (including in this). I also zoomed in.

Here’s how I did that:
plinkPCA(subVec = c("Druze","GreatBritain"),labelPlot = c("Lithuanians"),xLim=c(-0.01,0.0125),yLim=c(0.05,0.062))

To look at stuff besides PC 1 and PC 2 you can do plinkPCA(PC=c("PC3","PC6")).

I put the PCA function in the script, but to remove individuals you will want to run the PCA manually:

./plink --bfile EstSubset --pca 10

You can remove individuals manually by creating a remove file. What I like to do though is something like this:
grep "randomID27 " EstSubset.fam >> remove.txt

The double-carat appends to the remove.txt file, so you can add individuals in the terminal in one window while running PCA and visualizing with R in the other (Eigensoft has an automatic outlier removal feature). Once you have the individuals you want to remove, then:

./plink --bfile EstSubset --remove remove.txt --make-bed --out EstSubset
./plink --bfile EstSubset --pca 10

Then visualize!

To make use of the pairwise Fst you need the fst.R script. If everything is set up right, all you need to do is type:
source("fst.R")

It will load the file and generate the tree. You can modify the script so you have an unrooted tree too.

The R script is what generates the FstMatrix.csv file, which has the matrix you know and love.

So now you have the PCA, Fst and admixture. What else? Well, there’s treemix.

I set the number of SNPs for the blocks to be 1000. So -k 1000. As well as global rearrangement. You can change the details in the perl script itself. Look to the bottom. I think the main utility of my script is that it generates the input files. The treemix package isn’t hard to run once you have those input files.

Also, as you know treemix comes with R plotting functions. So run treemix with however many migration edges (you can have 0), and then when the script is done, load R.

Then:
>source("src/plotting_funcs.R")
>plot_tree("TreeMix")

But actually, you don’t need to do the above. I added a script to generate a .png file with the treemix plot in pairwise.perl. It’s called TreeMix.TreeMix.Tree.png.

OK, so that’s it.

To review:

Download zip or tar.xz file. Decompress. All the packages and scripts should be in there, along with a pretty big dataset of modern populations. If you are on a non-Mac Linux you are good to go. If you are on a Mac, you need the Mac versions of admixture, plink, and treemix. I’m going to warn you compiling treemix can be kind of a pain. I’ve done it on Linux and Mac machines, and gotten it to work, but sometimes it took time.

You need R and/or R Studio (or something like R Studio). Make sure to install the packages or the scripts for visualizing results from PCA and pairwiseFst won’t work.*

There is already a .csv output from admixture. The PCA also generates expected output files. You may want to sort, so open it in a spreadsheet.

This is potentially just the start. But if you are a layperson with a nagging question and can’t wait for me, this could be you where you need to go!

* I wrote a lot of these things piecemeal and often a long time ago. It may be that not all the packages are even used. Don’t bother to tell me.