Substack cometh, and lo it is good. (Pricing)

Unleash the data kraken!

The Reich lab has done a mitzvah and released a huge merged dataset of their modern and ancient populations in a big tarball. Actually, there are two files. One of them is a larger number of individuals with 600,000 SNPs (includes “Human Origins Array”) and the other has 1,200,000 SNPs, but fewer individuals. It is in EIGENSTRAT format.

For the convenience of readers who are more comfortable in PLINK/PEDIGREE format, I’ve converted them, and replaced the family ID column with population labels. The links take to you a zip file that has the three files for the binary format.


4 thoughts on “Unleash the data kraken!

  1. Reich’s files have Tianyuan’s Y haplogroup… finally.
    It is K2b, confirmed through at least 2 SNPs so it is almost certain.
    Somewhat unlikely to be K2b2 because 5 SNPs on K2b2* were all ancestral.
    The files did not have most SNP’s on the K2b1* branch so it is incoclusive.

    I wonder if it is possible to merge and filter these files so that rs numbers are shown along with allele states right by them and one does not have to run plink and look through .ped files.

  2. one ‘hack’ i do sometimes is just get the rsID i want. use –extract on just that rsid and then have a ped file with only that RS id.

  3. I have been using –extract alrealy;without it it would be nearly impossible.
    And I made a .bat file so that I click once and everything is done.

    Still it is hard because .ped file messes up the order of SNPs. This is very annoying when you have more than 5 SNPs to look up. I could just pre-order SNPs from the lowest to the highest but that is a lot of work too.

    Most SNPs in these files don’t even have the rs number. Typically I see things like snp_24_28711417. 24 means Y-chromosome and the number next is the position number.

    It will take a lot of work to figure out this guy’s phylogenetic position other than just K2b. Things should be put into Excel or something and I may have to use things like Qbasic to automate the process. This is the first time I do this sort of things in genetics so I am not familiar with useful programs that may expedite the process.

    The processes to automate are
    1. Find SNPs that are on
    K2b* branch
    K2b1* branch
    K2b2* branch
    either rs numbers or position numbers at the least.
    2. see if they are in the Reich file by checking .bim file.
    3. make the list to use with –extract
    4. determine which ones are ancestral, derived or no result(0,0).

    You, Razib, should be interested in this possible collateral ancestor (“great uncle” or a distant patrilineal cousin of your direct ancestor).

  4. I decided to use a 5 man arrray
    Ust Ishim
    P guy
    M guy
    S guy

    P M S dudes just need to be as complete as possible preferably from WGS so that they do not have any missing SNP calls.
    This array for each locus completely determines the phylogenetic position of Tianyuan man. One does not have to know the rs number etc. It does not matter even if the SNP is not even registered.

    How stupid of me not to have come up with this until now.
    The only problem is to view a .ped file that has a humongous number of columns. I have been using Wordpad. That will look horrible.
    Any help on how to view .ped will be greatly appreciated.

Comments are closed.