Substack cometh, and lo it is good. (Pricing)

DIY your DTC DNA

The screenshot to the right is from my updated 23andMe results. The inference that my ancestry is from “Chittagong Division” is 100% correct. More precisely, my family is from Comilla. There are some cases where consumer genomics tells you exactly what you know, and this is one of those. 23andMe has an excellent method to infer ancestry, and the power of a massive database. If you want to see how they do it, check out their new preprint. It’s pretty fancy.

What would be more informative is if they let me see what it means to be from Chittagong Division compared to other Bengalis. Or to be Bengali compared to other South Asians. You probably already know your recent history through basic family genealogy, but what do these results tell you about your deep history and your relatedness to global populations?

Of course, realistically even the largest direct-to-consumer genomics companies can only deliver so much, because they are simultaneously serving millions of customers. A custom approach isn’t a feasible ask, even if it is what many consumers are longing for. Little surprise then that people have been reaching out to me about anomalies in their results for over a decade. People like me who got no value-added information (as in: they already know where their great-grandparents were born) reach out too. We’re driven to know more.

That is why this coming week I’m offering a first workshop through Speakeasy titled Analyze Your 23andMe and Ancestry Data (and N.B. it didn’t make the title but I’ve prepared everything to work right on results from Family Tree DNA as well). It’s Wednesday, Jan 27, 5pm PT/8 pm ET. 

Here’s how it will work. You’ll arrive in class with your (or a friend’s or relative’s) 23andMe, Ancestry or Family Tree DNA raw data. Before class you’ll get a zip folder with all the files and utilities you need for class. You’ll have downloaded R & R Studio if you don’t already have them (I include instructions, but don’t worry, this is quick and easy!). And you’ll want to decide whether to Zoom into the meeting on one device and access your data on the other (or work on a two-screen setup if you’re on a desktop); neither is essential, but I’d consider them nice to have. 

At that point you’re ready for the workshop. And we’ll get straight into digging into your data. You can come to class with a question or questions about your ancestry. Or I can help you zero in on what might be interesting given what you already know.

Over the course of the hour, I’ll guide you through the use of three tools. No lecture, just hands-on doing, with your own actual data. Two of your tools are open-source utilities written by academics for their colleagues that have been in wide use in genomics for over a decade. Even long-time readers who aren’t here for the genetics will recognize Plink and Admixture, which I’ve referenced on this blog thousands of times. 

In addition to easing you right into using these core tools of the trade (without any of the usual slow initial learning curve), I have built an automated pipeline just for participants in the class. This is your third tool, which will save you untold startup hours no matter who you are. I’ve created an automated workflow so that you can input your raw data from any of those three DTC genetics companies and analyze it (including automatically generating formatted output) against your choice of reference populations (a library of which I’ve also prepared for you) and 2. automatically plot and visualize your results in a flexible, customizable format.

I have written all the scripts for you in order to create a custom, automated pipeline. This draws on my years of experience using these tools and guiding others. Then, the bespoke reference population library of human genomes I’ve curated for this purpose instantly equips you to measure your relatedness to any branch or branches of humanity (you get 5000 human genotypes culled from public datasets and selected to represent 250 distinct populations, on a quarter of a million markers (SNPs).

And in class I will teach you and guide you through using your toolkit in real time. Getting started in Plink and Admixture can entail hours of trouble-shooting and false starts. A decade into using them, I know the quirks and idiosyncrasies of these programs all too well; and that’s why I’ve built a pipeline that allows you to leapfrog over those slow early steps and get right into your (or any) genetic data. Building and merging a reference panel from publicly available sources is time-consuming and a headache and the individuals don’t come clearly labeled by population. I’ve got everything ready and clearly identified for you. With these two headstarts, and lots of pointers about best practices along the way, you’ll be asking and answering (and outputting visualizations of) your own questions in your first hour.

By the end of the workshop, you’ll have the skills and the tools to analyze genotypes against world population data. You’ll be able to use Plink, Admixture and the pipeline I’ve created for you on any DTC genomic results from those main three companies. You’ll have both the curated reference library and the pipeline I built for you… And the know-how to use them to your ends. You’ll also have reference cheat-sheets to remember how to do everything you tried in the workshop (I don’t want you having to take notes when you can be learning by doing!) 

Who is this for? You. I promise. My goal with this project is to make it accessible and easy for everyone with basic personal computing literacy. Not programming, not command line, not R. Just be comfortable on your computer. (And for this iteration, you need to be on a Mac or Ubuntu/Linux OS. I’m still working out a kink in Windows, so DM me or comment below if you’d like to trial it on Windows once I get that working.) 

I want to reach people who aren’t geneticists. I want to reach people who think they can’t do this. I want to show curious people who have never heard of any of the tools I’m naming that they can still delve into their DNA on their own terms the first time they sit down and try. (I did a trial run of the course with a crew of friends recently. Everyone did great, including the two who were anxious beforehand. And let’s be real: if you’re thinking you’re not tech-savvy enough, it probably says less about your actual tech skills than it does about your friends/family and how tech-nerdy they are!)

But enough about you. Let’s get back to me and what I did with my 23andMe results. To answer the question of what it means to be “Bengali from Chittagong” I analyzed myself against only a few populations and compared myself to the Bengalis in the 1000 Genomes. 

The PCA immediately shows that as someone from eastern Bengal, I’m on the edge of the eastern Bengali cluster. I have a lot of East Asian ancestry. To be from Chittagong Division means to be 10-20% East Asian.

Let’s rerun this with more South Asian populations for context and zoom in:

Again, you see I’m on the edge of the Bengali cluster. The same general result, but a different context. You can see here that Bengalis are somewhat different from Tamils, but closer to them than other Indian populations in the west and north of the subcontinent.

I’ve seen plenty of other Bangladeshis and Bengalis on 23andme. Many of us get pretty generic “You are 85-99% Bengali” interpretations. But that masks the reality of variable East-Asian ancestry across Bengal (more in the east, less in the west).

It’s good to check with another method, so I used ADMIXTURE, and ran the Bangladeshis with Tamils (useful to represent South Indian for the purposes of my exploration), Dai (East Asian), and Iranians, as the reference populations. Out of 83 Bangladeshis in the sample, I’m ranked 11 in percent Iranian and seven percent Dai. This aligns with what we see in the PCA. I have less “Tamil” ancestry proportion. What does that mean? Basically, I have less generic Indian ancestry than the typical South Asian person.

How long did it take me to figure this out? About 15 minutes. The methods of simple data exploration aren’t difficult or necessarily time-consuming. They’re just about the initial set-up time (which my course helps you leapfrog over) plus repetition and iteration.

This week is my first time running this, and I’m pretty excited. I trust the turnout this first time will be light (or perhaps we should say “select”). But these tools are something I’m eager to share. I want to equip people to be able to pose their own questions and help themselves, their family and friends get more out of their data. So in addition to what will probably be a very nice student-teacher ratio, you’ll be helping me shape something in order to empower as many lay people as possible.

14 thoughts on “DIY your DTC DNA

  1. Two questions (not participating this time but very likely to at some point in the future if you do this again):

    Do you need to have updated 23andme to the newer chip or is the older data good enough?

    If you have results downloaded from two or more of 23andMe, Ancestry, and Family Tree is there any reason to favor one over the others for this exercise?

    Thanks.

  2. old data is fine

    no big difference though the newest version of 23andme chip has a lower overlap with the snp array

    i am probably going to do a lot of updating after this workshop of the files and what not. might start a slack group

  3. no way i can do deconvolution in a 1 hour thing!!! and this is my first.

    already have 6 sign-ups

    i may think about doing deconvolution in the future, but only for ppl who already did this course. i can’t just into that immediately

    so yes genomewide

  4. Are you planning on offering a workshop in the future where Windows users can participate with the three tools?

  5. yes

    this is the first time, so i’m investing a lot of time and effort figuring this out. hopefully in the future i can move onto next steps

  6. You talked me into it! I’m a longtime GNXP reader and amateur genealogist. I’m looking forward to getting more familiar with these tools, not only to understand my own ancestry, but also to better understand your posts here.

  7. After learning to use Dienekes’ DIY calculators; (about 7 years ago) I used your how-to regarding Plink from the discover-magazine blog [good-stuff, thank you!]. Then I progressed to Alder, Beagle, RfMix, TreeMix, ChromoPainter and dozens of other approaches. Currently I am working with FastSMC … Regarding ADMIXTURE-like programs there is a disparity between deriving global ancestry (by putting an entire genome into “one window” and allowing a maximum-likelihood algorithm to “tamper” the “evidence” regardless of how one experiments with levels of convergence); and more speculative approaches regarding local ancestry in smaller “windows.” I don’t believe there is a single tool that will reliably answer the question of where all of one’s ancestors came from (and when). Though I’m positive that most methodologies will return an appropriate Clinal grouping of one’s ancestry. 23andMe, I believe, is striving very hard to tell customers what they already suspect.

    As to ‘A scalable pipeline for local ancestry inference using tens of thousands of reference haplotypes ‘ — I’m curious how they divided a human genome into ~1800 windows of ~0.6 cM per window…?

  8. According to Admixture Studio’s genotype coverage analysis feature, 23andme’s raw data (V5) only gives me about 25% coverage. While Ancestry’s raw data (Tested with them about 2 years ago) gives me about 95% genotype coverage. Moreover, the results are noticeably different, when I analyze them using various amateur tools like Vahaduo. Thus, I would say, Ancestry is a better option to go with IMHO.

  9. @thewarlock

    A lot of people feel pride at our indigenous roots. Take it from this Mallu.

    But the Pathan guy seems to be a South Asian troll, not a real Pathan. For one, real Pathans really hate the word Pathan. I know a few in real life. Pathan is a Indo-Paki word. The real ones call themselves Pushtun. Secondly, the dude is way too invested ib the difference between Sindhi and Gujarati. Why would a real one even care

    Ive noticed that a lot of Pakistani trolls like to adopt Pathan personas. They’re impressed with them I guess

Comments are closed.