Unleash the data kraken!


The Reich lab has done a mitzvah and released a huge merged dataset of their modern and ancient populations in a big tarball. Actually, there are two files. One of them is a larger number of individuals with 600,000 SNPs (includes “Human Origins Array”) and the other has 1,200,000 SNPs, but fewer individuals. It is in EIGENSTRAT format.

For the convenience of readers who are more comfortable in PLINK/PEDIGREE format, I’ve converted them, and replaced the family ID column with population labels. The links take to you a zip file that has the three files for the binary format.

The people of the Andaman Islands are not genetic fossils

So this is in the news, Police: American adventurer John Allen Chau killed by isolated Sentinelese tribe on Indian island. There is some talk about whether the guy was a Christian missionary or not, but that’s not really too relevant. Whether he believes in evolution or not (he was a graduate of a very conservative Christian college), he definitely won a Darwin award before he expired.

North Sentinel is totally isolated, and the people who live there, the Sentinelese, are out of contact with the rest of the world. They are hostile to the outside world. And this is probably why the Sentinelese are still around, as the outside world does not have a good track record with hunter-gatherers. The Andamanese as a whole had a reputation for being very hostile to outsiders, as traders knew not to stop too long for water.

Because the Sentinelese are back in the news, lots of stuff is being said about them in terms of their ancestry.

First, they are not that genetically unique. A recent paper on the genetics of Southeast Asia using ancient samples makes their affinities clear.  The Onge, an Andamanese tribe, are positioned close to the two ancient samples from Laos and Malaysia. They emerge out of the same milieu as Paleolithic Southeast Asians (whose  Hoabinhian culture persisted deep into the Holocene).

The Andamanese themselves are probably from mainland Southeast Asia. The gap between the islands and the mainland was smaller ~20,000 years ago when the sea levels were lower. They could have come up from the south or the north.

Second, they are not the most “ancient” people. That doesn’t make any sense. We are all people who are equally ancient. We all descend by and large outside of Africa from a migratory wave that expanded ~60,000 years ago. Andamanese, Chinese, and Europeans. What is “ancient” about them is that they are hunter-gatherers who have continued to practice that mode of production down to the present. But that’s a matter of culture and not genetics.

Third, in alignment with the above two points, they are not uniquely and distinctly isolated from all other human populations. They are not descendants of an early wave out of Africa preserved on these islands. They are not distinct from all other non-Africans. Rather, they seem to be closer to the peoples of Oceania, Papuans, and Australian Aboriginals, than Northeast Asians. And closer to Northeast Asians than they are to West Eurasians. The latest evidence is that the Andamanese were part of a broader diversification of lineages ~40-50,000 years ago to the east of India that gave rise to the peoples of the western Pacific Rim. Within this broader set of groups, some form a distinct clade that is not with Northeast Asians (often these are like “Australasian”).

Finally, the census size for the Sentinelese is in the range of 100 individuals. This seems on the edge of viability over the long term.

It’s raining selective sweeps

A week ago a very cool new preprint came out, Identifying loci under positive selection in complex population histories. It’s something that you can’t even imagine just ten years ago. The authors basically figure out ways to identify deviations of markers from expected allele frequency given a null neutral evolutionary model. The method is put first, which I really like, before getting to results or discussion. Additionally, they did a lot of simulation ahead of time. The sort of simulation that is really not possible before the sort of computational resources we have now.

Here’s the abstract:

Detailed modeling of a species’ history is of prime importance for understanding how natural selection operates over time. Most methods designed to detect positive selection along sequenced genomes, however, use simplified representations of past histories as null models of genetic drift. Here, we present the first method that can detect signatures of strong local adaptation across the genome using arbitrarily complex admixture graphs, which are typically used to describe the history of past divergence and admixture events among any number of populations. The method – called Graph-aware Retrieval of Selective Sweeps (GRoSS) – has good power to detect loci in the genome with strong evidence for past selective sweeps and can also identify which branch of the graph was most affected by the sweep. As evidence of its utility, we apply the method to bovine, codfish and human population genomic data containing multiple population panels related in complex ways. We find new candidate genes for important adaptive functions, including immunity and metabolism in under-studied human populations, as well as muscle mass, milk production and tameness in particular bovine breeds. We are also able to pinpoint the emergence of large regions of differentiation due to inversions in the history of Atlantic codfish.

On a related note in regards to selection, On the well-founded enthusiasm for soft sweeps in humans: a reply to Harris, Sackman, and Jensen. The authors are responding to a recent preprint criticizing their earlier work. The reason that it’s fascinating to me is that these sorts of arguments today are really concrete and not so theoretical. There’s a lot of data for analytic techinques to chew through, and computation has really transformed the possibilities.

A generation ago these sorts of debates would be a sequence of “you’re wrong!” vs. “no, you’re wrong!” Today the disputes involve a lot of data, and so have a reasonable chance of resolution.

The first preprint identifies the usual candidates in humans that you normally see, and expected targets in cattle and cod. Sure, that will given biologists more interested in mechanisms and pathways things to chew upon, but imagine once researchers have large numbers of genomes for thousands and thousands of species. Then they’ll be testing deviations from neutral allele frequencies across many trees, and getting a more general and abstract sense of the parameter that selection explores, conditional on particularities o evolutionary history.

This is why I’m excited about plans to sequence lots and lots of species.

The post-neutral human genome (the Kern-Hahn era)

If you have any background in evolutionary biology you are probably aware of the controversy around the neutral theory of molecular evolution. Fundamentally a theoretical framework, and instrumentally a null hypothesis, it came to the foreground in the 1970s just as empirical molecular data in evolutionary was becoming a thing.

At the same time that Motoo Kimura and colleagues were developing the formal mathematical framework for the neutral theory, empirical evolutionary geneticists were leveraging molecular biology to more directly assay natural allelic variation. In 1966 Richard Lewontin and John Hubby presented results which suggested far more variation than they had been expecting. Lewontin argued in the early 1970s that their data and the neutral model actually was a natural extension of the “classical” model of expected polymorphism as outlined by R. A. Fisher, as opposed to the “balance school” of Sewall Wright. In short, Lewontin proposed that the extent of polymorphism was too great to explain in the context of the dynamics of the balance school (e.g., segregation load and its impact on fitness), where numerous selective forces maintained variation. The classical school emphasized both strong selective sweeps on favored alleles and strong constraint against most new mutations.

And yet one might expect low levels of polymorphism from the classical school. The way in which the neutral framework was a more natural extension of this model is that even if most inter-specific variation, most substitutions across species, are due to selectively neutral variants, most variants could nevertheless be deleterious and so constrained. Alleles which increase in frequency may have done so through positive selection, or, just random drift. Not balancing forces like diversifying selection and overdominance.

The general argument around neutral theory generated much acrimony and spilled out from the borders of population genetics and molecular evolution to evolutionary biology writ large. Stephen Jay Gould, Simon Conway Morris, and Richard Dawkins, were all under the shadow of neutral theory in their meta-scientific spats about adaptation and contingency.

That was then, this is now. I’ve already stated that sometimes people overplay how much genomics has transformed our understanding of evolutionary biology. But in the arguments around neutral theory, I do think it has had a salubrious impact on the tone and quality of the discourse. Neutral theory and the great controversies flowered and flourished in an age where there was some empirical data to support everyone’s position. But there was never enough data to resolve the debates.

From where I stand, I think we’re moving beyond that phase in our intellectual history. To be frank, some of the older researchers who came up in the trenches when Kimura and his bête noire John Gillespie were engaged a scientific dispute which went beyond conventional collegiality seem to retain the scars of that era. But younger scientists are more sanguine, whatever their current position might be because they anticipate that the data will ultimately adjudicate, because there is so much of it.

With that historical context, consider a new paper, Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences:

Disentangling the effect on genomic diversity of natural selection from that of demography is notoriously difficult, but necessary to properly reconstruct the history of species. Here, we use high-quality human genomic data to show that purifying selection at linked sites (i.e. background selection, BGS) and GC-biased gene conversion (gBGC) together affect as much as 95% of the variants of our genome. We find that the magnitude and relative importance of BGS and gBGC are largely determined by variation in recombination rate and base composition. Importantly, synonymous sites and non-transcribed regions are also affected, albeit to different degrees. Their use for demographic inference can lead to strong biases. However, by conditioning on genomic regions with recombination rates above 1.5 cM/Mb and mutation types (C↔G, A↔T), we identify a set of SNPs that is mostly unaffected by BGS or gBGC, and that avoids these biases in the reconstruction of human history.

This is not an entirely surprising result. Some researchers in human genetics have been arguing for the pervasiveness of background selection, selection against deleterious alleles which effects nearby regions, for nearly a decade. In contrast, there are others who argue selective sweeps driven by positive selection are important in determining variation. Unlike the 1970s and 1980s these researchers don’t evince much acrimony, in part because the data keeps coming, and ultimately they’ll probably converge on the same position. And, the results may differ by species or taxon.

If you want a less technical overview than the paper, Kelley Harris has an excellent comment accompanying it. If you want to know what I mean by the Kern-Han era, it’s a joke due to the publication of The Neutral Theory in Light of Natural Selection.

Finally, some of you might wonder about the implications for demographic inference which preoccupies me so much on this weblog. In the big picture, it probably won’t change a lot, but it will be important for the details. So this is a step forward. That being said, the possibility of variable mutation rates and recombination rates across time and between lineages are also probably quite important.

The population genetic structure of China (through noninvasive prenatal testing)


This week a big whole genome analysis of China was published in Cell, Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History. The abstract:

We analyze whole-genome sequencing data from 141,431 Chinese women generated for non-invasive prenatal testing (NIPT). We use these data to characterize the population genetic structure and to investigate genetic associations with maternal and infectious traits. We show that the present day distribution of alleles is a function of both ancient migration and very recent population movements. We reveal novel phenotype-genotype associations, including several replicated associations with height and BMI, an association between maternal age and EMB, and between twin pregnancy and NRG1. Finally, we identify a unique pattern of circulating viral DNA in plasma with high prevalence of hepatitis B and other clinically relevant maternal infections. A GWAS for viral infections identifies an exceptionally strong association between integrated herpesvirus 6 and MOV10L1, which affects piwi-interacting RNA (piRNA) processing and PIWI protein function. These findings demonstrate the great value and potential of accumulating NIPT data for worldwide medical and genetic analyses.

In The New York Times write-up there is an interesting detail, “This study served as proof-of-concept, he added. His team is moving forward on evaluating prenatal testing data from more than 3.5 million Chinese people.” So what he’s saying is that this study with >100,000 individuals is a “pilot study.” Let that sink in.

Read More

The derived SNP that causes dry earwax was not found in all non-Africans

A new paper on Chinese genomics using hundreds of thousands of low-coverage data from NIPT screenings is making some waves. I’ll probably talk about the paper at some point. But I want to highlight the frequency of rs17822931 in Han Chinese. It’s pretty incredible how high it is.

Because the derived variant SNP, which is correlated with dry flaky earwax when present in homozygote genotypes, is also associated with less body odor, it has been studied extensively by East Asian geneticists. Basically, individuals who are homozygote for the ancestral SNP, which is the norm in Europe, the Middle East, and Africa, tend to have more body odor, and in societies and contexts where this is offensive these people are subject to more ostracism in East Asia as they are a minority (some of the studies in Japan were motivated by conscripts who elicited complaints from their colleagues).

The relatively low frequency in Guangxi is to be expected. This province was Sinicized only recently. As in, the last 500 years. And it still retains a huge ethnic minority population, and many of the Han in the province likely have that ancestry. But the question still arises: why do the Han have such a high frequency of rs17822931?

Here’s a plot of frequencies:

Read More

Chinese and Indian American population genetic structure

In Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past David Reich makes the observation that India is a nation of many different ethnicities, while China is dominated by a single ethnicity, the Han. This is obviously true, more or less. Even today the vast majority of Indians seem to be marrying with their own communities, jati.

Over the years I’ve collected many different genotypes of Americans of various origins who have purchased personal genomics kits, and given me their raw results. I decided to go through my collection and strip detailed ethnic labels and simply group together all those individuals from India, and China, who have had their genotypes done from one of the major services.

I suspect that these individuals are representative of “Indian Americans” and “Chinese Americans.” So what’s their genetic structure?

Read More

How related should you expect relatives to be?

Like many Americans in the year 2018 I’ve got a whole pedigree plugged into personal genomic services. I’m talking from grandchild to grandparent to great-aunt/uncles. A non-trivial pedigree. So we as a family look closely at these patterns, and we’re not surprised at this point to see really high correlations in some cases compared to what you’d expect (or low).

This means that you can see empirically the variation between relatives of the same nominal degree of separation from a person of interest. For example, each of my children’s’ grandparents contributes 25% of their autosomal genome without any prior information. But I actually know the variation of contribution empirically. For example, my father is enriched in my daughter. My mother is my sons.

The sample principle applies to siblings. Though they should be 50% related on their autosomal genome, it turns out there is variation. I’ve seen some papers large data sets (e.g., 20,000 sibling pairs) which gives a standard deviation of 3.7% in relatedness. But what about other degrees of relation?

Read More

David Burbridge’s 10 questions for A. W. F. Edwards In 2006

A few years ago I watched a documentary about the rise of American-influenced rock music in Britain in the 1960s. At some point, one of the Beatles, probably Paul McCartney, or otherwise Eric Clapton, was quoted as saying that they wanted to introduce Americans to “their famous people.” Though patronizing and probably wrong, what they were talking about is that there were particular blues musicians who were very influential in some British circles were lingering in obscurity in the United States of America due to racial prejudice. The bigger picture is that there are brilliant people who for whatever reason are not particularly well known to the general public.

This is why I am now periodically “re-upping” interviews with scientists that we’ve done on this weblog over the past 15 years. These are people who should be more famous. But aren’t necessarily.

In 2006 David Burbridge, a contributor this weblog and a historian of things Galtonian, interviewed the statistical geneticist A. W. F. Edwards. Edwards was one of R. A. Fisher’s last students, so he has a connection to a period if history that is passing us by.

I do want to say that his book, Foundations of Mathematical Genetics, really gave me a lot of insights when I first read it in 2005 and began to be deeply interested in pop gen. It’s dense. But short. Additionally, I have also noticed that there is now a book out which is a collection of Edwards’ papers, with commentaries, Phylogenetic Inference, Selection Theory, and a History of Science. Presumably, it is like W. D. Hamilton’s Narrow Roads of Gene Land series. I wish more eminent researchers would publish these sorts of compilations near the end of their careers.

There have been no edits below (notice the British spelling). But I did add some links!

David’s interview begins after this point:

Read More

My interview of James F. Crow in 2006

Since the death of L. L. Cavalli-Sforza I’ve been thinking about the great scientists who have passed on. Last fall, I mentioned that Mel Green had died. There was a marginal personal connection there. I had the privilege to talk to Green at length about sundry issues, often nonscientific. He was someone who been doing science so long he had talked to Charles Davenport in the flesh (he was not complimentary of Davenport’s understanding of Mendelian principles). It was like engaging with a history book!

A few months before I emailed Cavalli-Sforza, I had sent a message on a lark to James F. Crow. It was really a rather random thing, I never thought that Crow would respond. But in fact he emailed me right back! And he answered 10 questions from me, as you can see below the fold. The truth is I probably wouldn’t have thought to try and get in touch with Cavalli-Sforza if it hadn’t been so easy with Crow.

If you are involved in population genetics you know who Crow is. No introduction needed. Some of the people he supervised, such as Joe Felsenstein, have gone on to transform evolutionary biology in their own turn.

Born in 1916, Crow’s scientific career spanned the emergence of population genetics as a mature field, to the discovery of the importance of DNA, to molecular evolution & genomics. He had a long collaboration with Motoo Kimura, the Japanese geneticist instrumental in pushing forward the development of “neutral theory.”

He died in 2012.

Below are the questions I asked 12 years ago. My interests have changed somewhat, so it’s interesting to see what I was curious about back then. And of course fascinating to read Crow’s responses.
Read More