Tutorial to run supervised admixture analyses

ID Dai Gujrati Lithuanians Sardinian Tamil
razib_23andMe 0.14 0.26 0.02 0.00 0.58
razib_ancestry 0.14 0.26 0.02 0.00 0.58
razib_ftdna 0.14 0.26 0.02 0.00 0.57
razib_daughter 0.05 0.14 0.29 0.18 0.34
razib_son 0.07 0.17 0.28 0.19 0.30
razib_son_2 0.06 0.19 0.29 0.19 0.27
razib_wife 0.00 0.07 0.55 0.38 0.00

This is a follow-up to my earlier post, Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command. Hopefully, you’ll be able to run supervised admixture analysis with less hassle after reading this. Here I’m pretty much aiming for laypeople. If you are a trainee you need to write your own scripts. The main goal here is to allow people to run a lot of tests to develop an intuition for this stuff.

The above results are from a supervised admixture analysis of my family and myself. The fact that there are three replicates of me is because I converted my 23andMe, Ancestry, and Family Tree DNA raw data into plink files three times. Notice that the results are broadly consistent. This emphasizes that discrepancies between DTC companies in their results are due to their analytic pipeline, not because of data quality.

The results are not surprising. I’m about ~14% “Dai”, reflecting East Asian admixture into Bengalis. My wife is ~0% “Dai”. My children are somewhere in between. At a low fraction, you expect some variance in the F1.

Now below are results for three Swedes with the same reference panel:

Group ID Dai Gujrati Lithuanians Sardinian Tamil
Sweden Sweden17 0.00 0.09 0.63 0.28 0.00
Sweden Sweden18 0.00 0.08 0.62 0.31 0.00
Sweden Sweden20 0.00 0.05 0.72 0.23 0.00

All these were run on supervised admixture frameworks where I used Dai, Gujrati, Lithuanians, Sardinians, and Tamils, as the reference “ancestral” populations. Another way to think about it is: taking the genetic variation of these input groups, what fractions does a given test focal individual shake out at?

The commands are rather simple. For my family:
bash rawFile_To_Supervised_Results.sh TestScript

For the Swedes:
bash supervisedTest.sh Sweden TestScript

The commands need to be run in a folder: ancestry_supervised/.

You can download the zip file. Decompress and put it somewhere you can find it.

Here is what the scripts do. First, imagine you have raw genotype files downloaded fromy 23andMe, Ancestry, and Family Tree DNA.

Download the files as usual. Rename them in an intelligible way, because the file names are going to be used for generating IDs. So above, I renamed them “razib_23andMe.txt” and such because I wanted to recognize the downstream files produced from each raw genotype. Leave the extensions as they are. You need to make sure they are not compressed obviously. Then place them all in  RAWINPUT/.

The script looks for the files in there. You don’t need to specify names, it will find them. In plink the family ID and individual ID will be taken from the text before the extension in the file name. Output files will also have the file name.

Aside from the raw genotype files, you need to determine a reference file. In REFERENCEFILES/ you see the binary pedigree/plink file Est1000HGDP. The same file from the earlier post. It would be crazy to run supervised admixture on the dozens of populations in this file. You need to create a subset.

For the above I did this:
grep "Dai\|Guj\|Lithua\|Sardi\|Tamil" Est1000HGDP.fam > ../keep.keep

Then:
./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript

When the script runs, it converts the raw genotype files into plink files, and puts them in INDIVPLINKFILES/. Then it takes each plink file and uses it as a test against the reference population file. That file has a preprend on group/family IDs of the form AA_Ref_. This is essential for the script to understand that this is a reference population. The .pop files are automatically generated, and the script inputs in the correct K by looking for unique population numbers.

The admixture is going to be slow. I recommend you modify runadmixture.pl by adding the number of cores parameters so it can go multi-threaded.

When the script is done it will put the results in RESULTFILES/. They will be .csv files with strange names (they will have the original file name you provided, but there are timestamps in there so that if you run the files with a different reference and such it won’t overwrite everything). Each individual is run separately and has a separate output file (a .csv).

But this is not always convenient. Sometimes you want to test a larger batch of individuals. Perhaps you want to use the reference file I provided as a source for a population to test? For the Swedes I did this:
grep "Swede" REFERENCEFILES/Est1000HGDP.fam > ../keep.keep

Then:
./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out INDIVPLINKFILES/Sweden

Please note the folder. There are modifications you can make, but the script assumes that the test files are inINDIVPLINKFILES/. The next part is important. The Swedish individuals will have AA_Ref_ prepended on each row since you got them out of Est1000HGDP. You need to remove this. If you don’t remove it, it won’t work. In my case, I modified using the vim editor:
vim Sweden.fam

You can do it with a text editor too. It doesn’t matter. Though it has to be the .fam file.

After the script is done, it will put the .csv file in RESULTFILES/. It will be a single .csv with multiple rows. Each individual is tested separately though, so what the script does is append each result to the file (the individuals are output to different plink files and merged in; you don’ t need to know the details). If you have 100 individuals, it will take a long time. You may want to look in the .csv file as the individuals are being added to make sure it looks right.

The convenience of these scripts is that it does some merging/flipping/cleaning for you. And, it formats the output so you don’t have to.

I originally developed these scripts on a Mac, but to get it to work on Ubuntu I made a few small modifications. I don’t know if it still works on Mac, but you should be able to make the modifications if not. Remember for a Mac you will need the make versions of plink and admixture.

For supervised analysis, the reference populations need to make sense and be coherent. Please check the earlier tutorial and use the PCA functions to remove outliers.

Again, here is the download to the zip files. And, remember, this only works on Ubuntu for sure (though now I hear it’s easy to run Ubuntu in Windows).

Tutorial to run PCA, Admixture, Treemix and pairwise Fst in one command


Today on Twitter I stated that “if the average person knew how to run PCA with plink and visualize with R they wouldn’t need to ask me anything.” What I meant by this is that the average person often asks me “Razib, is population X closer to population Y than Z?” To answer this sort of question I dig through my datasets and run a few exploratory analyses, and get back to them.

I’ve been meaning to write up and distribute a “quickstart” for a while to help people do their own analyses. So here I go.

The audience of this post is probably two-fold:

  1. “Trainees” who are starting graduate school and want to dig in quickly into empirical data sets while they’re really getting a handle on things. This tutorial will probably suffice for a week. You should quickly move on to three population and four population tests, and Eigensoft and AdmixTools. As well fineStructure
  2. The larger audience is technically oriented readers who are not, and never will be, geneticists professionally. 

What do you need? First, you need to be able to work in a Linux or Linux-environment. I work both in Ubuntu and on a Mac, but this tutorial and these scripts were tested on Ubuntu. They should work OK on a Mac, but there may need to be some modifications on the bash scripts and such.

Assuming you have a Linux environment, you need to download this zip or tar.xz file. Once you open this file it should decompress a folderancestry/.

There are a bunch of files in there. Some of them are scripts I wrote. Some of them are output files that aren’t cleaned up. Some of them are packages that you’ve heard of. Of the latter:

  • admixture
  • plink
  • treemix

You can find these online too, though these versions should work out of the box on Ubuntu. If you have a Mac, you need the Mac versions. Just replace the Mac versions into the folderancestry/. You may need some libraries installed into Ubuntu too if you recompile yourselfs. Check the errors and make search engines your friends.

You will need to install R (or R Studio). If you are running Mac or Ubuntu on the command line you know how to get R. If not, Google it.

I also put some data in the file. In particular, a plink set of files Est1000HGDP. These are merged from the Estonian Biocentre, HGDP, and 1000 Genomes. There are 4,899 individuals in the data, with 135,000 high quality SNPs (very low missingness).

If you look in the “family” file you will see an important part of the structure. So do:

less Est1000HGDP.fam

You’ll see something like this:
Abhkasians abh154 0 0 1 -9
Abhkasians abh165 0 0 1 -9
Abkhazian abkhazian1_1m 0 0 2 -9
Abkhazian abkhazian5_1m 0 0 1 -9
Abkhazian abkhazian6_1m 0 0 1 -9
AfricanBarbados HG01879 0 0 0 -9
AfricanBarbados HG01880 0 0 0 -9

There are 4,899 rows corresponding to each individual. I have used the first column to label the ethnic/group identity. The second column is the individual ID. You can ignore the last 4 columns.

There is no way you want to analyze all the different ethnic groups. Usually, you want to look at a few. For that, you can use lots of commands, but what you need is a subset of the rows above. The grep command matches and returns rows with particular patterns. It’s handy. Let’s say I want just Yoruba, British (who are in the group GreatBritain), Gujurati, Han Chinese, and Druze. The command below will work (note that Han matches HanBeijing, Han_S, Han_N, etc.).

grep "Yoruba\|Great\|Guj\|Han\|Druze" Est1000HGDP.fam > keep.txt

The file keep.txt has the individuals you want. Now you put it through plink to generate a new file:

./plink --bfile Est1000HGDP --keep keep.txt --make-bed --out EstSubset

This new file has only 634 individuals. That’s more manageable. But more important is that there are far fewer groups for visualization and analysis.

As for that analysis, I have a Perl script with a bash script within it (and some system commands). Here is what they do:

1) they perform PCA to 10 dimensions
2) then they run admixture on the number of K clusters you want (unsupervised), and generate a .csv file you can look at
3) then I wrote a script to do pairwise Fst between populations, and output the data into a text file
4) finally, I create the input file necessary for the treemix package and then run treemix with the number of migrations you want

There are lots of parameters and specifications for these packages. You don’t get those unless you to edit the scripts or make them more extensible (I have versions that are more flexible but I think newbies will just get confused so I’m keeping it simple).

Assuming I create the plink file above, running the following commands mean that admixture does K = 2 and treemix does 1 migration edge (that is, -m 1). The PCA and pairwise Fst automatically runs.

perl pairwise.perl EstSubset 2 1

Just walk away from your box for a while. The admixture will take the longest. If you want to speed it up, figure out how many cores you have, and edit the file makecluster.sh, go to line 16 where you see admixture. If you have 4 cores, then type -j4 as a parameter. It will speed admixture up and hog all your cores.

There is as .csv that has the admixture output. EstSubset.admix.csv. If you open it you see something like this:
Druze HGDP00603 0.550210 0.449790
Druze HGDP00604 0.569070 0.430930
Druze HGDP00605 0.562854 0.437146
Druze HGDP00606 0.555205 0.444795
GreatBritain HG00096 0.598871 0.401129
GreatBritain HG00097 0.590040 0.409960
GreatBritain HG00099 0.592654 0.407346
GreatBritain HG00100 0.590847 0.409153

Column 1 will always be the group, column 2 the individual, and all subsequent columns will be the K’s. Since K = 2, there are two columns. Space separated. You should be able to open the .csv or process it however you want to process it.

You’ll also see two other files: plink.eigenval plink.eigenvec. These are generic output files for the PCA. The .eigenvec file has the individuals along with the values for each PC. The .eigenval file shows the magnitude of the dimension. It looks like this:
68.7974
38.4125
7.16859
3.3837
2.05858
1.85725
1.73196
1.63946
1.56449
1.53666

Basically, this means that PC 1 explains twice as much of the variance as PC 2. Beyond PC 4 it looks like they’re really bunched together. You can open up this file as a .csv and visualize it however you like. But I gave you an R script. It’s RPCA.R.

You need to install some packages. First, open R or R studio. If you want to go command line at the terminal, type R. Then type:
install.packages("ggplot2")
install.packages("reshape2")
install.packages("plyr")
install.packages("ape")
install.packages("igraph")
install.packages("ggplot2")

Once those packages are loaded you can use the script:
source("RPCA.R")

Then, to generate the plot at the top of this post:
plinkPCA()

There are some useful parameters in this function. The plot to the left adds some shape labels to highlight two populations. A third population I label by individual ID. This second is important if you want to do outlier pruning, since there are mislabels, or just plain outlier individuals, in a lot of data (including in this). I also zoomed in.

Here’s how I did that:
plinkPCA(subVec = c("Druze","GreatBritain"),labelPlot = c("Lithuanians"),xLim=c(-0.01,0.0125),yLim=c(0.05,0.062))

To look at stuff besides PC 1 and PC 2 you can do plinkPCA(PC=c("PC3","PC6")).

I put the PCA function in the script, but to remove individuals you will want to run the PCA manually:

./plink --bfile EstSubset --pca 10

You can remove individuals manually by creating a remove file. What I like to do though is something like this:
grep "randomID27 " EstSubset.fam >> remove.txt

The double-carat appends to the remove.txt file, so you can add individuals in the terminal in one window while running PCA and visualizing with R in the other (Eigensoft has an automatic outlier removal feature). Once you have the individuals you want to remove, then:

./plink --bfile EstSubset --remove remove.txt --make-bed --out EstSubset
./plink --bfile EstSubset --pca 10

Then visualize!

To make use of the pairwise Fst you need the fst.R script. If everything is set up right, all you need to do is type:
source("fst.R")

It will load the file and generate the tree. You can modify the script so you have an unrooted tree too.

The R script is what generates the FstMatrix.csv file, which has the matrix you know and love.

So now you have the PCA, Fst and admixture. What else? Well, there’s treemix.

I set the number of SNPs for the blocks to be 1000. So -k 1000. As well as global rearrangement. You can change the details in the perl script itself. Look to the bottom. I think the main utility of my script is that it generates the input files. The treemix package isn’t hard to run once you have those input files.

Also, as you know treemix comes with R plotting functions. So run treemix with however many migration edges (you can have 0), and then when the script is done, load R.

Then:
>source("src/plotting_funcs.R")
>plot_tree("TreeMix")

But actually, you don’t need to do the above. I added a script to generate a .png file with the treemix plot in pairwise.perl. It’s called TreeMix.TreeMix.Tree.png.

OK, so that’s it.

To review:

Download zip or tar.xz file. Decompress. All the packages and scripts should be in there, along with a pretty big dataset of modern populations. If you are on a non-Mac Linux you are good to go. If you are on a Mac, you need the Mac versions of admixture, plink, and treemix. I’m going to warn you compiling treemix can be kind of a pain. I’ve done it on Linux and Mac machines, and gotten it to work, but sometimes it took time.

You need R and/or R Studio (or something like R Studio). Make sure to install the packages or the scripts for visualizing results from PCA and pairwiseFst won’t work.*

There is already a .csv output from admixture. The PCA also generates expected output files. You may want to sort, so open it in a spreadsheet.

This is potentially just the start. But if you are a layperson with a nagging question and can’t wait for me, this could be you where you need to go!

* I wrote a lot of these things piecemeal and often a long time ago. It may be that not all the packages are even used. Don’t bother to tell me.

Black ancestry in white Americans of colonial background

I stumbled upon striking photographs of “white slaves” while reading The United States of the United Races: A Utopian History of Racial Mixing. The backstory here is that in the 19th century abolitionists realized that Northerners might be more horrified as to the nature of slavery if they could find children of mostly white ancestry, who nevertheless were born to slave mothers (and therefore were slaves themselves). So they found some children who had either been freed, or been emancipated, and dressed them up in more formal attire (a few more visibly black children were presented for contrast).

This illustrates that the media and elites have been using this ploy for a long time. I am talking about the Afghan girl photograph, or the foregrounding of blonde and blue-eyed Yezidi children. Recently I expressed some irritation on Twitter when there was a prominent photograph of a hazel-eyed Rohingya child refugee being passed around. Something like 1 in 500 people in that region of the world has hazel eyes! That couldn’t be a coincidence. Race matters when it comes to compassion.

But this post isn’t about that particular issue…rather, the images of enslaved white children brought me back to a tendency I’ve seen and wondered about: the old stock white Americans whose DNA results suggest ~1% or less Sub-Saharan ancestry. These are not uncommon, and I’ve looked at several of them (raw data). I’m pretty sure the vast majority at the 0.5% or more threshold are true positives, and probably many a bit below this (to my experience people from England and Ireland don’t get 0.3% African “noise” estimates with the modern high-density marker sets).

According to 23andMe’s database about 1 out of 10 white Southerners has African ancestry at the 1% threshold. It would be even more if you dropped to closer to 0.5%. And the DNA ancestry here understates the extent of what was going on: at about 10 generations back you are about 50% likely to inherit zero blocks of genomic ancestry from a given ancestor (assuming no inbreeding in the pedigree obviously). And this is exactly when a lot of the ancestry that is being detected seems to have “entered” the white population. In other words, for every person who is 1% African and 99% white American, they have a sibling who is 0% African and 100% white American, even though genealogically they share the same ancestors. Dropping the threshold to closer to 0.3%, and considering that even in the South there was migration from the North, and to a lesser extent Europe, after the Civil War, I wouldn’t be surprised if models of admixture inferred from the distributions we see indicate that over half the lowland Southern white population likely had genealogical descent from a black slave.

This all comes to mind because there aren’t too many records of people “passing” during this period. Those who deal in genealogy and encounter these cases of low fractions, which are nevertheless likely not false positives, almost never find a “paper trail” when they go look. And they look really hard.

The reason is obvious in the context of American history. Thomas Jefferson’s slave Sally Hemings had three white grandparents and one African slave grandparent. Several of her children are recorded to have been totally European in an appearance, and all except one passed into the white population (the two eldest married well into affluent white families in Washington D.C.). Passing as white was a way to escape the debilities of black status in the United States.

That being said, I think our Whig conception of the progressive nature of history sometimes misleads us in forgetting that the dynamics of race relations has had its ups and downs several times in the last few centuries in North America. If you read Daniel Walker Howe’s excellent What Hath God Wrought you observe that racial beliefs about the necessity and institutionalization of white supremacy in the early American republic evolved over time. Though the early republic would never be judged racially enlightened by modern lights, it was certainly far less explicitly racially conscious than what was the norm in the decades before the Civil War.

In particular, the rise of democratic populism during the tenure of Andrew Jackson was connected with much more muscular racial nationalism. To utilize a framework emphasized by David Cannadine in Ornamentalism, colonialism and Western civilization during the 19th and early 20th centuries can be viewed through the lens of race and class. Though the economic inequalities of American society persisted through the 19th century, men such as Andrew Jackson affected a more populist and rough-hewn persona than the aristocratic presidents of the early 19th century.* The white man’s republic had a leveling effect on the nature of elite culture.

But the attitudes toward racial segregation and mixing took decades to harden. Martin van Buren’s vice president, Richard Mentor Johnson, was well known to have had a common-law wife, Julia Chinn, who was a slave. He recognized his two daughters by her. He was vice president from 1837-1841 in the more racist of the two American political parties of the time. It is hard to imagine this being a viable “lifestyle” choice for someone of this prominence in later decades (after Julia Chinn’s death Johnson continued to enter into relationships with slaves).

Walter F. White, a black leader of the NAACP

Which brings us back to what was happening in the decades around 1800. Racism was a fact of life, necessitating the need for passing. But, beliefs about racial purity and the one drop rule had not hardened, so it would not be surprising to me that it was much easier for slaves or ex-slaves with mostly European ancestry to change their identity. Perhaps white Americans of that period were simply less vigilant about someone’s background because they were genuinely less concerned about the possibility that their partner may have had some black ancestry, so long as they looked white.

As the databases grow larger we’ll get a better sense of the demographic and genealogical dynamics. My suspicion is that we’ll see that there wasn’t much diminishment of gene flow into the black-identified community over the past 200 years, as much as the fact that hypo-descent, the one-drop rule, became so powerful in the between 1850 and 1950 we can confirm that passing declined, before rising again in the 1960s as whites became less vigilant due to decreased racism.

* As a middle class New Englander John Adams obviously was no aristocrat, but he was no populist either.

Carving nature at its joints more realistically

If you are working on phylogenetic questions on a coarse evolutionary scale (that is, “macroevolutionary,” though I know some evolutionary geneticists will shoot me the evil eye for using that word) generating a tree of relationships is quite informative and relatively straightforward, since it has a comprehensible mapping onto to what really occurred in nature. When your samples are different enough that the biological species concept works well and gene flow doesn’t occur between node, then a tree is a tree (one reason Y and mtDNA results are so easy to communicate to the general public in personal genomics).

Everything becomes more problematic when you are working on a finer phylogenetic scale (or in taxa where inter-species gene flow is common, as is often the case with plants). And I’m using problematic here in the way that denotes a genuine substantive analytic issue, as opposed to connoting something that one has moral or ethical objections to.

It is intuitively clear that there is often genetic population structure within species, but how to summarize and represent that variant is not a straightforward task.

In 2000 the paper Inference of Population Structure Using Multilocus Genotype Data in Genetics introduced the sort of model-based clustering most famously implemented with Structure. The paper illustrates limitations with the neighbor-joining tree methods which were in vogue at the time, and contrasts them with a method which defines a finite set of populations and assigns proportions of each putative group to various individuals.

The model-based methods were implemented in numerous packages over the 2000s, and today they’re pretty standard parts of the phylogenetic and population genetic toolkits. The reason for their popularity is obvious: they are quite often clear and unambiguous in their results. This may be one reason that they emerged to complement more visualization methods like PCA and MDS with fewer a priori assumptions.

But of course, crisp clarity is not always reality. Sometimes nature is fuzzy and messy. The model-based methods take inputs and will produce crisp results, even if those results are not biologically realistic. They can’t be utilized in a robotic manner without attention to the assumptions and limitations (see A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots).

This is why it is exciting to see a new preprint which addresses many of these issues, Inferring Continuous and Discrete Population Genetic Structure Across Space*:

A classic problem in population genetics is the characterization of discrete population structure in the presence of continuous patterns of genetic differentiation. Especially when sampling is discontinuous, the use of clustering or assignment methods may incorrectly ascribe differentiation due to continuous processes (e.g., geographic isolation by distance) to discrete processes, such as geographic, ecological, or reproductive barriers between populations. This reflects a shortcoming of current methods for inferring and visualizing population structure when applied to genetic data deriving from geographically distributed populations. Here, we present a statistical framework for the simultaneous inference of continuous and discrete patterns of population structure….

The whole preprint should be read for anyone interested in phylogenomic inference, as there is extensive discussion and attention to many problems and missteps that occur when researchers attempt to analyze variation and relationships across a species’ range. Basically, the sort of thing that might be mentioned in peer review feedback, but isn’t likely to be included in any final write-ups.

As noted in the abstract the major issue being addressed here is the problem that many clustering methods do not include within their model the reality that genetic variation within a species may be present due to continuous gene flow defined by isolation by distance dynamics. This goes back to the old “clines vs. clusters” debates. Many of the model-based methods assume pulse admixtures between population clusters which are random mating. This is not a terrible assumption when you consider perhaps what occurred in the New World when Europeans came in contact with the native populations and introduced Africans. But it is not so realistic when it comes to the North European plain, which seems to have become genetically differentiated only within the last ~5,000 years, and likely seen extensive gene flow.

The figure below shows the results from the conStruct method (left), and the more traditional fastStructure (right):

There are limitations to the spatial model they use (e.g., ring species), but that’s true of any model. The key is that it’s a good first step to account for continuous gene flow, and not shoehorning all variation into pulse admixtures.

Though in beta, the R package is already available on github (easy enough to download and install). I’ll probably have more comment when I test drive it myself….

* I am friendly with the authors of this paper, so I am also aware of their long-held concerns about the limitations and/or abuses of some phylogenetic methods. These concerns are broadly shared within the field.

Why humans have so many pulse admixtures

The Blank Slate is one of my favorite books (though I’d say The Language Instinct is unjustly overshadowed by it). There is obviously a substantial biological basis in human behavior which is mediated by genetics. When The Blank Slate came out in the early 2000s one could envisage a situation in 2017 when empirically informed realism dominated the intellectual landscape. But that was not to be. In many ways, for example in sex differences, we’ve gone backward, while there is still undue overemphasis in our society on the environmental impact parents have on children (as opposed to society more broadly).

But genes do not determine everything, obviously. Several years after reading The Blank Slate I read Not by Genes Alone: How Culture Transformed Human Evolution. In this work Peter Richerson and Robert Boyd outline their decades long project of modeling cultural variation and evolution formally in a manner reminiscent of biological evolution. Richerson and Boyd’s program does not start from a “blank slate” assumption. Rather, it is focused on broad macro-social dynamics where cultural variation “swamps” out biological variation.

Recall that in classic population genetic theory a major problem with group level selection is that gene flow between adjacent groups quickly removes between group variation. One migrant between two groups per generation is enough for them not to diverge genetically. For group selection to occur the selective effect has to be very strong or the between group difference has to be very high. Rather than talking about genetics though, where the debate is still live, and the majority consensus is still that biological group selection is not that common (depending on how you define it), let’s talk about human culture.

Here the group level differences are extreme and the boundaries can be sharp. Historically it seems likely that most groups which were adjacent to each other looked rather similar because of gene flow and similar selective pressures. Even though in medieval Spain there was a generality, probably true, that Muslims were swarthier than Christians*, there was a palpable danger in battle of identifying friend from foe because the two groups overlapped too much in appearance.

This brings up how one might delineate differences culturally. In battle opposing armies wear distinct uniforms and colors so that the distinction can be made. But obviously one change uniform surreptitiously (perhaps taking the garb from the enemy dead). This is why physical adornment such as tattoos are useful, as they are “hard to fake.” Perhaps the most clear illustration of this dynamic is the Biblical story for the origin of the term shibboleth. Even slight differences in accent are clear to all, and, often difficult to mimic once in adulthood.

Biological evolution mediated through genes is relatively slow and constrained compared to cultural evolution. Whole regions of central and northern Europe shifted from adherence to Roman Catholicism to forms of Protestantism on the order of 10 years. Of course religion is an aspect of culture where change can happen very rapidly, but even language shifts can occur in only a few generations (e.g., the decline of regional German and Italian dialects in the face of standard forms of the language).

Cultural evolution as a formally modeled neofunctionalism is credibly outlined in works such as Peter Turchin’s Ultrasociety: How 10,000 Years of War Made Humans the Greatest Cooperators on Earth. That’s not what I want to focus on here. Rather, I contend that the reality of massive pulse admixtures evident in the human genome over the past 10,000 years, at minimum, is a function of the fact that human cultural evolutionary processes result in winner-take-all genetic consequences.

A concrete example of what I’m talking about would compare the peoples of the Italian peninsula and the Iberian peninsula around 1500. The two populations are not that different genetically, and up to that point shared many cultural traits (and continue to do so). But, a combination of geography and history resulted in Iberian demographic expansion in the several hundred years after 1500, whereby today there are probably many more descendants of Iberians than Italians. This is not a function of any deep genetic difference between the two groups. There aren’t deep genetic differences in fact. Rather, the social and demographic forces which propelled Iberia to imperial status redounded upon the demographic production of Iberians in the future. In addition, the New World underwent a massive pulse admixture between Iberians, and native Amerindians, as well as Africans, usually brought over as slaves, due the cultural and political history of the period.

The pulse admixture question is rather interesting academically. To some extent current methods are biased toward detection of pulse admixtures, and even fit continuous gene flow as pulse admixtures. A quick rapid exchange of gene flow and then recombination breaking apart associations of markers which are ancestrally informative haplotypes is something you can test for. But I think we can agree that the gene flow triggered by the Columbian Exchange was a pulse admixture, and there’s too much concurrent evidence from uniparental lineage turnover in the ancient DNA to dismiss the non-historically corroborated signatures of pulses as simply artifacts.

Nevertheless continuous gene flow does occur. That is, normal exchange of individuals between neighboring demes as a slow simmer over time. But the idea that we are a clinal ring species or something like that isn’t right in my opinion. Part of the story are strong geographical barriers. But another major part is that cultural revolutions and advantages introduce huge short-term demographic advantages to particular groups, and the shake out of inter-group competition can be dramatic.

Therefore, I make a prediction: the more cultural evolutionary dynamics a species is subject to, the more pulse admixture you’ll be able to detect. For example, pulse admixture should be more important in social insects than their solitary relatives.

* Not only was some of the ancestry of Muslims North African, Muslim rule was longest in the southern and southeastern regions, where people were not as fair as in the north.