Swidden rice farming does not lead to high population density

Admixture on K = 5

I’ve been looking at the data from the recent Munda paper. Standard stuff, admixture, treemix, and f-statistics.The northern Munda samples were collected in Bangladesh. So I thought: I can test the hypothesis that the East Asian ancestry in Bangladesh is to a large part Santhal. After looking at it every which way, I think that in fact, the Munda may not have ever been very populous in much of northeast India. The Santhal is just not a good donor population to Bengalis, at least not when comparing mixes such as Dai + Tamil.

Additionally, the Santhal are really not that well modeled by mixing South Asians with any particular Southeast Asian group, though it works. I think that’s suggestive of the possibility that the Austro-Asiatic group which gave rise to the Munda don’t exist in their current form anywhere in Southeast Asia. Additionally, the Lao samples that are provided in the new paper I think may have Indian ancestry via admixture from Austro-Asiatic Mon or Khmer groups.

Basically, there is so much bidirectional gene flow that I think it’s really hard to get a grip on what’s going on. Additionally, the Burmese and northeast Indian populations (e.g., the Mizos) clearly have a strand of ancestry that derives from relatively recent migrants that came down from the region of eastern Tibet, and perhaps Sichuan or even further north. And this component shows up in Bengalis as well.

On top of this, there is the “Australo-Melanesian” substrate that is present all across Southeast Asia, and probably was present in modern southern China in the early Holocene, which has distant affinities with the “Ancient Ancestral South Indians” (AASI).

At this point, I keep my own counsel. But there may be an interesting story to tell related to how efficient and effective different forms of agriculture were, and how that interplayed with genes and language.

Running AdmixTools through R – admixr

One of the reasons that I don’t post AdmixTools results too much is that the framework requires more statistical “deep thought” than just popping out a PCA or even running some model-based clustering. Read the methods supplements of one of the Reich lab’s papers, and you’ll see what I’m getting at. But a more prosaic reason is that I generally work in the plink format, and format conversion, as well as editing parameter files, is a pain. In general, I don’t do much “exploratory AdmixTools” stuff for a reason.

Martin Petr has made the second excuse a lot less of an excuse. His admixr package gives one an easy interface into AdmixTools. In particular, it allows one not to have to edit parameter files so much. It took me about ~15 minutes to get it downloaded and running. I’m on a Mac and for R use RStudio.

– remember to install wget if you are on a Mac (this will show up if you want to use online datasets)

– You need to make sure to set the path to AdmixTools. In the RStudio console, I just entered:

Sys.setenv("PATH"="~/MyPath/To/AdmixTools/bin/")

If you can get AdmixTools installed in the first place, admixr should be very easy.

Continuous gene flow vs. pulse admixture

In the new preprint Ancient genomics: a new view into human prehistory and evolution the authors write:

The geographic structure of these population transformations gave rise to population structure of present-day Europe. For example Anatolian Neolithic ancestry is highest in southern European populations like Sardinians, and lowest in northern European populations (38). Steppe ancestry is at high frequency in north-central Europeans and low in the south. Isolation-by-distance may have contributed to these patterns to some extent, but the contribution must have been small. In much of Europe, extreme population discontinuity was the norm.

Basically, they are contrasting pulse admixtures with continuous gene flow. One stylized model of the settling of the world after the “Out of Africa” migration is that most of the extant population structure was established by about ~20,000 years ago, and much of what has occurred since then has been divergence due to barriers to gene flow, as well as homogenization due to continuous gene flow.

Ancient DNA has basically overthrown that model. There is just too much turnover in some parts of the world in rapid succession for variation to have been patterned exclusively by continuous gene flow. On the other hand, some researchers have felt that pulse admixture is a little overemphasized in the current narrative, in part because it’s a good simplifying model for explaining the origins of daughter populations with roots in two or more parental groups (e.g., model-based clustering and Treemix both assume pulse admixture). That doesn’t mean that this is a correct description of reality, just that it is a tractable one. This sort of concern motivated papers such as A Spatial Framework for Understanding Population Structure and Admixture.

Of course, the “conflict” between people who accept pulse admixture and those who accept continuous gene flow is not a conflict at all. Really it is simply people as a whole attempting to get a better of sense of how frequent pulse admixtures are in the context of a demographic landscape of continuous gene flow. This isn’t the 1970s when selectionists and neutralists argued over small crumbs of data. There’s enough data to test a lot of alternatives and slowly but surely converge upon a consensus.

Which brings me to the question: are these dynamics relevant outside of humans? It strikes me that for plants and other sessile organisms we’d assume that continuous gene flow dominates. At the other extreme, you have birds…who are so mobile that I also believe that continuous gene flow dominates here also. In contrast, land-based tetrapods are much more mobile than plants, but often stymied by temporary barriers such as rivers or rising sea levels. So there would be more pulse admixtures, because continuous gene flow would be interrupted, and then perhaps the barrier would disappear, in which case rapid admixture would occur.

Humans are a curious cause because I believe one reason that pulse admixture might be more prevalent is that we we create our own barrier. Culture.

Recollections of Mel Green


Mel Green co-taught a “history of genetics” course that I took as a first-year grad student at UC Davis. It was fitting because Mel Green was a living embodiment of the history of genetics. Mine was one of the last years that Mel co-taught that class, so I feel quite privileged.

Unlike some of my friends who have gone through Davis I only had a few conversations with Mel. But he gave us the wisdom of a life of learning and seeing genetics evolve as a discipline over the 20th century. It isn’t often that you talk to someone who could dismiss Charles Davenport because he had talked to the man and judged that he had a poor grasp of Mendelian theory!

Most everyone has a “Mel Green story.” So let me recount mine. Though it doesn’t have to do me with as such. Mel lived 101 years, and was active in science by the 1940s. In our history of genetics course we had to give a presentation on a particular topic (mine was on polytene chromosomes). The student who was giving the presentation on Drosophila research was not a genetics student. I had assumed she would be a bit nervous because Mel was a renowned Drosophilist, and he was sitting right there listening to everything.

At some point she began to refer to a researcher, “M Green.” She went on about “M Green” and his work for about five minutes, at one point pausing to note that “M Green” even worked at Davis! At this point the co-instructor had to stop her and tell her that “M Green” was sitting in the room, right next to her. Because the research was published in the 1940s the student had assumed that this was from someone who could never have been alive in the present. But there it was, Mel Green was still with us, a witness to all that history that had come and gone.

The non-European ancestry of Afrikaners


A few years ago I got some South African genotypes. Some of the individuals were clearly African. A few mapped perfectly upon Northern Europeans. But many of the samples consistently were European but shifted toward non-European populations.

Based on history of the assimilation of slaves into the European population of Cape Colony in the 18th century, my assumption is that these individuals are Afrikaners.

Recently I realized that Brenna Henn had released some more Khoisan samples, so I decided to look at this question of admixture again. The two Khoisan populations are the Nama and the Khomani. I removed those with lots of Bantu and European admixture and combined them together into one population.

Running unsupervised Admixture shows how distinct the South African whites are.

The average Utah white in this sample (this population is a mix of British, German, and Scandinavian in ancestry) is 99% European modal cluster, and 1% South Asian. The average for the white South Africans in this data set is 94% European modal cluster. The residual is 1% East Asian (Dai modal), 1% Khosian, 1% non-Khoisan African, and 2% South Asian.

I ran Treemix a bunch of times, and every single plot came out like this when I ran it for three migrations:

 

The gene flow from the Utah whites to the Gujuratis is simply an artifact of the fact that the Gujurati sample is mixed caste, and some of the Brahmin or Lohannas have more “Ancestral North Indian.” The gene flow from the Europeans to the Khoisan is probably real, or, might be due to pastoralist admixture via East Africans. The last migration arrow goes from the African populations to the South African whites, with a shift toward the Khoisan.

I also ran a three population test where A is the outgroup, and B and C are a clade. A significantly negative f3-statistic indicates admixture in population A. The negative values are listed below:

A B C f3 f3-error Z-score
Gujrati Dai UtahWhite -0.00121718 0.000140141 -8.68539
South_Africa EsanNigeria UtahWhite -0.00127718 0.000147982 -8.63059
South_Africa Khoisan_SA UtahWhite -0.0012928 0.000151416 -8.53802
Gujrati South_Africa Dai -0.000778791 0.000155656 -5.00329
South_Africa Dai UtahWhite -0.000541974 0.000133262 -4.06699
South_Africa UtahWhite Gujrati -0.000103581 8.46193e-05 -1.22408

This aligns well with the Admixture results. Afrikaners have both African ancestries, and, Asian ancestry.

In James Michener’s The Covenant one of the plot lines alludes to mixed ancestry in one of the Afrikaner families. The results above suggest that mixed ancestry is very common, and perhaps ubiquitous, in this population. True, there are some Afrikaners such as Hendrik Verwoerd who migrated to South Africa from the Netherlands in the past century or so, but these are uncommon to my knowledge.

Genetics books for the masses!

Since I’ve become professionally immersed in genetics I haven’t read many books on the topics. I read papers. And I do genetics. But back in the day I did enjoy a good book. The standard recommendation would be to read Matt Ridley’s Genome. It’s a bit dated now (it was published around when the Human Genome Project being completed), but I’d still recommend it.

But when in the mid-2000s I dabbled a little bit in the world of worm (C. elegans) genetics I read Andrew Brown’s In the Beginning Was the Worm: Finding the Secrets of Life in a Tiny Hermaphrodite. It’s pretty far from my current concerns and fixations, with more of a focus on developmental processes, but it is pretty cool to read about the race to “map” every cell in C. elegans.

The second book I’d recommend readers of this blog is the late Will Provine’s The Origins of Theoretical Population Genetics. Modern population genomics is a massive edifice built atop the foundations of the early 20th century fusion of Mendelism and the biometrical heirs of Darwin. Provine outlines how primitive genetics eventually seeded the birth of the Neo-Darwinian Synthesis.

Why do percentage estimates of “ancestry” vary so much?

When looking at the results in Ancestry DNA, 23andMe, and Family Tree DNA my “East Asian” percentage is:

– 19%
– 13%
– 6%

What’s going on here? In science we often make a distinction between precision and accuracy. Precision is how much your results vary when you re-run an experiment or measurement. Basically, can you reproduce your result? Accuracy refers to how close your measurement is to the true value. A measurement can be quite precise, but consistently off. Similarly, a measurement may be imprecise, but it bounces around the true value…so it is reasonably accurate if you get enough measurements just cancel out the errors (which are random).

The values above are precise. That is, if you got re-tested on a different chip, the results aren’t going to be much different. The tests are using as input variation on 100,000 to 1 million markers, so a small proportion will give different calls than in the earlier test. But that’s not going to change the end result in most instances, even though these methods often have a stochastic element.

But what about accuracy? I am not sure that old chestnuts about accuracy apply in this case, because the percentages that these services provide are summaries and distillations of the underlying variation. The model of precision and accuracy that I learned would be more applicable to the DNA SNP array which returns calls on the variants; that is, how close are the calls of the variant to the true value (last I checked these are arrays are around 99.5% accurate in terms of matching the true state).

What you see when these services pop out a percentage for a given ancestry is the outcome of a series of conscious choices that designers of these tests made keeping in mind what they wanted to get out of these tests. At a high level here’s what’s going on:

  1. You have a model of human population history and dynamics with various parameters
  2. You have data that that varies that you put into that model
  3. You have results which come back with values which are the best fit of that data to the model you specificed

Basically you are asking the computational framework a question, and it is returning its best answer to the question posed. To ask whether the answer is accurate or not is almost not even wrong. The frameworks vary because they are constructed by humans with difference preferences and goals.

Almost, but not totally wrong. You can for example simulate populations whose histories you know, and then test the models on the data you generated. Since you already know the “truth” about the simulated data’s population structure and history, you can see how well your framework can infer what you already know from the patterns of variation in the generated data.

Going back to my results, why do my East Asian percentages vary so much? The short answer is that one of the major variables in the model alluded to above is the nature of the reference population set and the labels you give them.

Looking at Bengalis, the ethnic group I’m from, it is clear that in comparison to other South Asian populations they are East Asian shifted. That is, it seems clear I do have some East Asian ancestry. But how much?

The “simple” answer is to model my ancestry is a mix of two populations, an Indian one and an East Asian one, and then see what the values are for my ancestry across the two components. But here is where semantics becomes important: what is Indian and East Asian? Remember, these are just labels we give to groups of people who share genetic affinities. The labels aren’t “real”, the reality is in the raw read of the sequence. But humans are not capable of really getting anything from millions of raw SNPs assigned to individuals. We have to summarize and re-digest the data.

The simplest explanation for what’s going on here is that the different companies have different populations put into the boxes which are “Indian/South Asian” and “East Asian.” If you are using fundamentally different measuring sticks, then there are going to be problems with doing apples to apples comparisons.

My personal experience is that 23andMe tends to give very high percentages of South Asian ancestry for all South Asians. Because “South Asian” is a very diverse category when tests come back that someone is 95-99% South Asian…it’s not really telling you much. In contrast, some of the other services may be using a small subset of South Asians, who they define as “more typical”, and so giving lower percentages to people from Pakistan and Bengal, who have admixture from neighboring regions to the west and east respectively.*

Something similar can occur with East Asian ancestry. If the “donor” ancestral groups are South Asian and East Asian for me, then the proportions of each is going to vary by how close the donor groups selected by the company is to the true ancestral group. If, for example, Family Tree DNA chose a more Northeastern Asian population than Ancestry DNA, then my East Asian population would vary between the two services because I know my East Asian ancestry is more Southeast Asian.

The moral of the story is that the values you obtain are conditional on the choices you make, and those choices emerge from the process of reducing and distilling the raw genetic variation into a manner which is human interpretable. If the companies decided to use the same model, the would come out with the same results.

* I helped develop an earlier version of MyOrigins, and so can attest to this firsthand.

When journalists get out of their depth on genetic genealogy

For some reason The New York Times tasked Gina Kolata to cover genetic genealogy and its societal ramifications, With a Simple DNA Test, Family Histories Are Rewritten. The problem here is that to my knowledge Kolata doesn’t cover this as part of her beat, and so isn’t well equipped to write an accurate and in depth piece on the topic in relation to the science.

This is a general problem in journalism. I notice it most often when it comes to genetics (a topic I know a lot about for professional reasons) and the Middle East and Islam (topics I know a lot about because I’m interested in them). It’s unfortunate, but it has also made me a lot more skeptical of journalists whose track record I’m unfamiliar with.* To give a contrasting example, Christine Kenneally is a journalist without a background in genetics who nevertheless is immersed in genetic genealogy, so that she could have written this sort of piece without objection from the likes of me (she did write a book on the topic, The Invisible History of the Human Race: How DNA and History Shape Our Identities and Our Futures, which I had a small role in fact-checking).

What are the problems with the Kolata piece? I think the biggest issue is that she didn’t go in to test any particular proposition, and leaned on the wrong person for the science. She quotes Joe Pickrell, who knows this stuff like the back of his hand. But more space is given to Jonathan Marks, an anthropologist who is quite opinionated and voluble, and so probably a “good source” for any journalist.

Marks seems well respected in anthropology from what I can tell, but he’s also the person who put up a picture of L. L. Cavalli-Sforza juxtaposed with a photo of Josef Mengele in the late 1990s during a presentation at Stanford. Perhaps this is why anthropologists respect him, I don’t know, but I do not like him because of his nasty tactics (I wouldn’t be surprised if Marks had power he would make sure people like me were put in political prison camps, his rhetoric is often so unhinged).

Marks’ quotes wouldn’t be much of an issue if Kolata could figure out when he’s making sense, and when he’s just bullshitting. But she can’t. For example:

…“tells me I’m 95 percent Ashkenazi Jewish and 5 percent Korean, is that really different from 100 percent Ashkenazi Jewish and zero percent Korean?”

The precise numbers offered by some testing services raise eyebrows among genetics researchers. “It’s all privatized science, and the algorithms are not generally available for peer review,” Dr. Marks said.

The part about precise numbers is an issue, though a lot less of an issue with high density SNP-chips (the real issue is sensitivity to reference population and other such parameters). But if a modern test says you are 95 percent Ashkenazi Jewish and 5 percent Korean it really is different from 100% Ashkenazi. Someone who comes up as 5% Korean against an Ashkenazi Jewish background is most definitely of some East Asian heritage. In the early 2000s with ancestrally informative markers and microsatellite based tests you’d get somewhat weird results like this, but with the methods used by the major DTC companies (and in academia) today these sorts of proportions are just not reported as false positives. Marks may not know because this isn’t his area, but Pickrell would have. Kolata probably did not think to double-check with him, but that’s because she isn’t able to smell out tendentious assertions. She has no feel for the science, and is flying blind.

Second, Marks notes that the science is privatized, and it isn’t totally open. But it’s just false that the algorithms are not generally available for peer review. All the details of the pipeline are not downloadable on GitHub, but the core ancestry estimation methods are well known. Eric Durand, who wrote the originally 23andMe ancestry composition methodology presented on it at ASHG 2013. I know because I was there during his session.

You can find a white paper for 23andMe’s method and Ancestry‘s. Not everything is as transparent as open science would dictate (though there are scientific papers and publications which also mask or hide elements which make reproducibility difficult), but most geneticists with domain experience can figure out what’s going on and it if it is legitimate. It is. The people who work at the major DTC companies often come out of academia, and are known to academic scientists. This isn’t blackbox voodoo science like “soccer genomics.”

Then Marks says this really weird thing:

“That’s why their ads always specify that this is for recreational purposes only: lawyer-speak for, ‘These results have no scientific standing.’”

Actually, it’s lawyer-speak for “do not sue us, as we aren’t providing you actionable information.” Perhaps I’m ignorant, but lawyers don’t get to define “scientific standing”.

The problem, which is real, is that the public is sometimes not entirely clear on what the science is saying. This is a problem of communication from the companies to the public. I’ve even been in scientific sessions where geneticists who don’t work in population genomics have weak intuition on what the results mean!

Earlier Kolata states:

Scientists simply do not have good data on the genetic characteristics of particular countries in, say, East Africa or East Asia. Even in more developed regions, distinguishing between Polish and, for instance, Russian heritage is inexact at best.

This is not totally true. We have good data now on China and Japan. Korea also has some data. Using haplotype-based methods you can do a lot of interesting things, including distinguish someone who is Polish from Russian. But these methods are computationally expensive and require lots of information on the reference samples (Living DNA does this for British people). The point is that the science is there. Reading this sort of article is just going to confuse people.

On the other hand a lot of Kolata’s piece is more human interest. The standard stuff about finding long lost relatives, or discovering your father isn’t your father. These are fine and not objectionable factually, though they’ve been done extensively before and elsewhere. I actually enjoyed the material in the second half of the piece, which had only a tenuous connection to scientific detail. I just wish these sorts of articles represented the science correctly.

Addendum: Just so you know, three journalists who regularly cover topics I can make strong judgments on, and are always pretty accurate: Carl Zimmer, Antonio Regalado, and Ewen Callaway.

* I don’t follow Kolata very closely, but to be frank I’ve heard from scientist friends long ago that she parachutes into topics, and gets a lot of things wrong. Though I can only speak on this particular piece.

The future will be genetically engineered


If the film Rise of the Planet of the Apes had come out a few years later I believe there would have been mention of CRISPR. Sometimes science leads to technology, and other times technology aids in science. On occasion the two are one in the same.

The plot I made above shows that in the first five years of the second decade of the 20th century CRISPR went from being an obscure aspect of bacterial genetics to ubiquitous. Friends who had been utilizing “advanced” genetic engineering methods such as TALENS and zinc fingers switched overnight to a CRISPR/Cas9 framework.

As I’ve said before the 2010s are the decade when “reading” the genome becomes normal. We really don’t know what the CRISPR/Cas9 technology is capable of. It’s early years yet. With that, First Human Embryos Edited in U.S.. Technically they’re single celled zygotes. The science itself is not astounding. Rather, it is that the human rubicon has been passed in the United States. As indicated in the article there has been some jealousy about what the Chinese have been able to do because of a different cultural and regulatory framework.

There are those calling for a moratorium on this work (on humans). I’m not in favor or opposed. Rather, my question is simple: if CRISPR/Cas9 makes genetic engineering cheap, easy, and effective, how exactly are we going to enforce a world-wide moratorium? A Butlerian Jihad?

Note: I know that people are freaking about humans + genetic engineering. But most geneticists I know are more excited about the prospects of non-human work, since human clinical trials are going to be way in the future. Over 20 years since Dolly it’s notable to me that no human has been cloned from adult somatic cells yet.