Substack cometh, and lo it is good. (Pricing)

Tutorial to run supervised admixture analyses


This is a follow-up to my earlier post, Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command. Hopefully, you’ll be able to run supervised admixture analysis with less hassle after reading this. Here I’m pretty much aiming for laypeople. If you are a trainee you need to write your own scripts. The main goal here is to allow people to run a lot of tests to develop an intuition for this stuff.

The above results are from a supervised admixture analysis of my family and myself. The fact that there are three replicates of me is because I converted my 23andMe, Ancestry, and Family Tree DNA raw data into plink files three times. Notice that the results are broadly consistent. This emphasizes that discrepancies between DTC companies in their results are due to their analytic pipeline, not because of data quality.

The results are not surprising. I’m about ~14% “Dai”, reflecting East Asian admixture into Bengalis. My wife is ~0% “Dai”. My children are somewhere in between. At a low fraction, you expect some variance in the F1.

Now below are results for three Swedes with the same reference panel:


All these were run on supervised admixture frameworks where I used Dai, Gujrati, Lithuanians, Sardinians, and Tamils, as the reference “ancestral” populations. Another way to think about it is: taking the genetic variation of these input groups, what fractions does a given test focal individual shake out at?

The commands are rather simple. For my family:
bash TestScript

For the Swedes:
bash Sweden TestScript

The commands need to be run in a folder: ancestry_supervised/.

You can download the zip file. Decompress and put it somewhere you can find it.

Here is what the scripts do. First, imagine you have raw genotype files downloaded from 23andMe, Ancestry, and Family Tree DNA.

Download the files as usual. Rename them in an intelligible way, because the file names are going to be used for generating IDs. So above, I renamed them “razib_23andMe.txt” and such because I wanted to recognize the downstream files produced from each raw genotype. Leave the extensions as they are. You need to make sure they are not compressed obviously. Then place them all in  RAWINPUT/.

The script looks for the files in there. You don’t need to specify names, it will find them. In plink the family ID and individual ID will be taken from the text before the extension in the file name. Output files will also have the file name.

Aside from the raw genotype files, you need to determine a reference file. In REFERENCEFILES/ you see the binary pedigree/plink file Est1000HGDP. The same file from the earlier post. It would be crazy to run supervised admixture on the dozens of populations in this file. You need to create a subset.

For the above I did this:
grep "Dai\|Guj\|Lithua\|Sardi\|Tamil" Est1000HGDP.fam > ../keep.keep

./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript

When the script runs, it converts the raw genotype files into plink files, and puts them in INDIVPLINKFILES/. Then it takes each plink file and uses it as a test against the reference population file. That file has a preprend on group/family IDs of the form AA_Ref_. This is essential for the script to understand that this is a reference population. The .pop files are automatically generated, and the script inputs in the correct K by looking for unique population numbers.

The admixture is going to be slow. I recommend you modify by adding the number of cores parameters so it can go multi-threaded.

When the script is done it will put the results in RESULTFILES/. They will be .csv files with strange names (they will have the original file name you provided, but there are timestamps in there so that if you run the files with a different reference and such it won’t overwrite everything). Each individual is run separately and has a separate output file (a .csv).

But this is not always convenient. Sometimes you want to test a larger batch of individuals. Perhaps you want to use the reference file I provided as a source for a population to test? For the Swedes I did this:
grep "Swede" REFERENCEFILES/Est1000HGDP.fam > ../keep.keep

./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out INDIVPLINKFILES/Sweden

Please note the folder. There are modifications you can make, but the script assumes that the test files are inINDIVPLINKFILES/. The next part is important. The Swedish individuals will have AA_Ref_ prepended on each row since you got them out of Est1000HGDP. You need to remove this. If you don’t remove it, it won’t work. In my case, I modified using the vim editor:
vim Sweden.fam

You can do it with a text editor too. It doesn’t matter. Though it has to be the .fam file.

After the script is done, it will put the .csv file in RESULTFILES/. It will be a single .csv with multiple rows. Each individual is tested separately though, so what the script does is append each result to the file (the individuals are output to different plink files and merged in; you don’ t need to know the details). If you have 100 individuals, it will take a long time. You may want to look in the .csv file as the individuals are being added to make sure it looks right.

The convenience of these scripts is that it does some merging/flipping/cleaning for you. And, it formats the output so you don’t have to.

I originally developed these scripts on a Mac, but to get it to work on Ubuntu I made a few small modifications. I don’t know if it still works on Mac, but you should be able to make the modifications if not. Remember for a Mac you will need the make versions of plink and admixture.

For supervised analysis, the reference populations need to make sense and be coherent. Please check the earlier tutorial and use the PCA functions to remove outliers.

Again, here is the download to the zip files. And, remember, this only works on Ubuntu for sure (though now I hear it’s easy to run Ubuntu in Windows).

One thought on “Tutorial to run supervised admixture analyses

  1. This is pretty cool to explore, thank you!

    Will there be a post(s) be about what machines we can buy to get the raw data, and the analsyis needed for the inputs for this?

    In practice, are you usually curious about /do more than 10 dimensions at a time? Is the limitation on the pca, k-means algos used, somwhere else?

Comments are closed.