Substack cometh, and lo it is good. (Pricing)

Ancestry analysis quickstart

Over the years I have posted periodic tutorials on how to do some simple admixture analysis. Initially this was to foster the growth of “genome blogging”, but that’s basically dead along with blogging as a whole (Eurogenes being the primary exception here).

But, unexpectedly it turns out a lot of baby-academics find my tutorials early in graduate school or when they are trying to transition into ancestry inference in population genetics. I have lots of scripts I use myself, but they are not too organized or clear to others. But, I did create two tutorials for simple pipelines that are useful to others, judging by how many emails I get.

The reason I get emails is occasionally I delete the scripts in the process of housekeeping…but I’ve now created a folder “tutorials” that I will make sure NOT to delete. So all the files are present below at the links now:

Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command

Tutorial To Run Supervised Admixture Analyses

In the near future, I’m going to clean up and post more scripts (e.g., some I use to make qpAdmin outputs earlier). But I will update this particular blog post since I think these posts are more useful/relevant for people doing web searches than regular readers.

4 thoughts on “Ancestry analysis quickstart

  1. Will deffo look into it Razib. But honestly,

    1) which ethnicity are Bengalis most akin to outside of South Asia.
    2) which ethnicity are they most distant to out slide SA.

    It’s about time you did an article on Bengali genetics. It’s been too long.

  2. Related to topic –

    https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btaa520/5838185“Efficient toolkit implementing best practices for principal component analysis of population genetic data” – May 2020 – a paper using UK BioBank dataset to covering best practices to minimize effects on PCA of LD blocks*, shrinkage from projection**, under/over sampling.

    *Lots of dimensions in raw UK BioBank PCA dominated by this – “We find that PC19 to PC40 in the UK Biobank capture complex LD structure rather than population structure”. Some papers that use very high dimension PCA to control for UK BioBank structure when looking at traits / selection, might need to look at this.

    ** Shrinkage from projection is addressed by projecting on a retained subset of UK BioBank and also by 1000 Genomes samples, to quantify shrinkage. They also do the reverse in supplements and project UK BioBank onto 1000 Genomes.

    Fig S6 shows interestingly relative few parts of 1000 Genomes samples PCA space are not covered by UK BioBank, though of course there are some… Something of a commentary on how very large samples can cover diversity. UK BioBank seems to include some (very few) individuals who are clinal between some points in 1000 Genomes PCA space that are not covered by 1000 Genomes samples.

    Fig S7 shows that there are some parts of UK BioBank computed space that aren’t covered by 1000 Genomes as well.

Comments are closed.