Razib Khan 30x whole-genome sequence data

About four years ago I posted my genotype data for anyone who wanted it. This included the raw export files from consumer genomics firms + my VCF file generated by Dante Labs.

Today I will make my raw data all public from Dante Labs. This means you can access

– raw reads
– .bam file
– .vcf files, as well as files with CNVs and SVs

This is all 30x coverage so be warned these aren’t the smallest files. Here is the link to my data.

If you find something noteworthy, reach out to me! For those who want geographic provenance, seven of my eight great-grandparents were born in the Comilla region of modern Bangladesh. The eight was born in the Noakhali region, just to the south of Comilla.

At least today we can explore personal genomics

A very long piece on the “personal genomics industry.” Lots of quotes from my boss Spencer Wells, since he has been in the game so long.

The piece covers all the bases. I actually think some of the criticisms of direct-to-consumer genetics are on base. I just don’t think they’re insoluble problems, or problems so large that that should discourage the industry from growing. I think part of the problem is that many of the people journalists can talk to who can comment on the industry are based in academia, and academia has a different focus when it comes to comes to genetics than the nascent industry. For rational reasons academics need to be very careful when it comes to ethics. Consumer products I think are somewhat different.

But I do think we need to reflect how far we’ve come in 10 years. Back in the 2000s when I was reading stuff on Y, mtDNA and autosomal studies, I honestly didn’t imagine that I would know my own haplogroups and genome-wide ancestry decomposition. It seemed like science fiction. That all changed rather rapidly over a few years, and I purchased kits in the early years when the price was still high. Today it’s a mass industry, with a sub-$100 price point in many cases.

Yes, there are plenty of cautions and worries we need to consider. But the future is already the present, and the horse has left the stable.

Notes from the personal genomic inflection point


There’s a debate that periodically crops up online about the utility, viability, and morality of returning results from genetic tests to consumers. Consumers here means people like you or me. Pretty much everyone.

If you want to caricature two stylized camps, there are information maximalists who proclaim a utopia now, where people can find out so much about themselves through their genome. And then there are information elitists, who emphasize that the public can’t handle the truth. Or, more accurately, that throwing information without context and interpretation from someone who knows better is not just useless, it’s dangerous.

Of course, most people will stake out more nuanced complex positions. That’s not the point. Here is my bottom-line, which I’ve probably held since about ~2010:

  1. The value for most people in actionable information in direct-to-consumer genetics is probably not there yet when set against the cost.
  2. With the reduction in the cost of genotyping and sequencing, there’s no way that we have enough trained professionals to handle the surfeit of information. And there will really be no way in 10 years when a large proportion of the American population will be sequenced.

At some point, the cost will come down enough, and the science probably is strong enough, that direct-to-consumer genetics moves away from novelty and early adopters to the mass market. At that point, we need to be able to make the best use of that data. Genetic counselors, geneticists, and doctors all cost a fair amount of money and have a finite amount of labor supply to provide to the public. They need to focus on serious, complex, and consequential cases.

To some extent, we need to reduce much of interpretation in the personal genomics space to an information technology problem. For example, if someone’s genotype pulls out a bunch of statistically significant hits of interest the tool should automatically condition significance on that individual’s genetic background.

Yes, there are primitive forms of these sorts of tools out there already. But they’re not good enough. And that’s because there isn’t the market need. But there will be.

The 23andMe BRCA test

In case you were sleeping under a rock, 23andMe got FDA approval for DTC testing of markers related to BRCA risk. Obviously, this is a pretty big step, in principle.

But the short-term implications are not that earth-shaking.

From the FDA release:

The three BRCA1/BRCA2 hereditary mutations detected by the test are present in about 2 percent of Ashkenazi Jewish women, according to a National Cancer Institute study, but rarely occur (0 percent to 0.1 percent) in other ethnic populations. All individuals, whether they are of Ashkenazi Jewish descent or not, may have other mutations in BRCA1 or BRCA2 genes, or other cancer-related gene mutations that are not detected by this test. For this reason, a negative test result could still mean that a person has an increased risk of cancer due to gene mutations….

Apparently, women with one of these variants have a 45-85% chance of developing breast cancer by age 70. So the penetrance is high.

It seems that you’ll know if this sort of test is going to have utility for you based on family history.

The big thing is the transition to DTC. This will increase availability and drive the price down. That’s probably going to mean more work for those engaged in interpretation and education. False positives are going to start being a major thing….

Personal genomics question: am I related to my 5th cousins?

There are some personal genomics questions I get over and over via email. I thought I would post an answer so that Google could pick it up.

One of them usually has do with if someone is “really” related to someone who comes up as a 5th cousin on a DTC service. What does “really” mean?

Graham Coop has done the formal work to show that it’s highly likely all our genealogies intersect some point in the recent past. Several years ago in his paper with Peter Ralph, The Geography of Recent Genetic Ancestry across Europe, they used genetic data to infer just how closely European lineages coalesced with each other over the past few thousand years.

So yes, you are related (though that doesn’t mean you have matching genetic segments).

But that’s not the question people are really asking about. They are asking, does this DNA match increase the probability that I’m somehow related to this person?

In general, I think not. For example, I regularly get queries from South Asians about distant matches with Europeans. Does this mean they are European? No. I think what it means is:

1) There are lots of Europeans in the database, so a false positive match is likely to be European, even if you are non-European.

2) At short genetic distances, the segments are really mostly some sort of false positive.

Genomic ancestry tests are not cons, part 1

As someone who is part of the personal genomics sector, I keep track of media representations of the industry very closely. There is the good and the bad, some justified and some not.

But there is one aspect which I need to weigh in on because it is close to my interests and professional focus, and it is one where I have a lot of experience: ancestry inference on human data.

Periodically I see in my Twitter timeline an article shared by a biologist which is filled with either misrepresentation, confusions, and even falsehoods. Of course, some of the criticisms are correct. The problem is that when you mix truth and falsehood or sober analysis and critique with sensationalism the whole product is debased.

I’m going to address some of the most basic errors and misimpressions. This post is “part 1” because I might have follow-ups, as I feel like this is a situation where I have to put out fires periodically, as people write about things they don’t know about, and then those articles get widely shared to a credulous public.

First, if an article mentions STRs or microsatellites or a test with fewer than 1,000 markers in a direct to consumer genomic context, ignore the article. This is like an piece where the author dismisses air travel because it’s noisy due to propeller-driven planes. Propeller-driven planes are a very small niche. Similarly, the major direct to consumer firms which have sold close to ~10 million kits do not use STRs or microsatellites, very much a technology for the 1990s and 2000s. Any mention of STRs or microsatellites or low-density analyses indicate the journalist didn’t do their homework, or simply don’t care to be accurate.

Second, there is constant harping on the fact that different companies give different results. This is because tests don’t really give results as much is interpretations. The raw results consist of your genotype. On the major SNP-chip platforms this will be a file on the order of 20 MBs. The companies could provide this as the product, but most humans have difficulty grokking over 100,000 variables.

So what’s the solution? The same that scientists have been using for decades: reduce the variation into a much smaller set of elements which are human digestible, often through tables or visualization.

For example, consider a raw data set consisting of my three genotypes from 23andMe, Ancestry, and Family Tree DNA. Merged with public data these are ~201,000 single nucleotide markers. You can download the plink formatted data yourself and look at it. The PCA below shows where my three genotypes are positioned, by the Tamil South Asians. Observe that my genotypes are basically at the same point:

The differences between the different companies have nothing to do with the raw data, because with hundreds of thousands of markers they capture enough of the relevant between population differences in my genome (do you need to flip a coin 1 million times after you’ve flipped it 100,000 times to get a sense of whether it is fair?). The law of large numbers is kicking in at this point, with genotyping errors on the order of 0.5% not being sufficient to differentiate the files.

Sure enough raw genotype files of the three services match pretty closely. 99.99% for Family Tree DNA and 23andMe, 99.7% for Family Tree DNA and Ancestry, and 99.6% for Ancestry and 23andMe. For whatever reason Ancestry is the outlier here. My personal experience looking at genotype data from Illumina chips is that most are pretty high quality, but it’s not shocking to see instances with 0.5% no call or bad call rates. For phylogenetic purposes if the errors are not systematic it’s not a big deal.

The identity to other populations is consistent. About 74% to Tamils. 72-73% for other Eurasians. 71% for the Surui, an isolated Amazonian group. And 69% to Yoruba. Observe that this recapitulates the phylogenetic history of what we know for the population which I am from, Bengalis. The greater the genetic distance between two populations due to distinct evolutionary histories the greater the genetic divergence. This is not rocket science. This gets to the point that the raw results make a lot more sense when you integrate and synthesize them with other information you have. Most customers are not going into the process of getting a personal genomic ancestry test blind…but that causes pitfalls as well as opportunities.

But most people do not receive statistics of the form:

SNP Identity
YouYoruba0.69
YouGerman0.72
YouJapanese0.73
YouTamil0.74

Mind you, this is informative. It’s basically saying I am most genetically distant from Yoruba and closer in sequence to Tamils. But this is somewhat thin gruel for most people. Consider the below which is a zoom in of PC 2 vs. PC 4. I am blue and the purple/pink are Tamils, and the population at the bottom left are East Asians.

If you looked at enough PCA plots it will become rather clear I am shifted toward East Asians in comparison to most other South Asians. The high identity that I have with Japanese and Dai is due in part to the fact that I have relatively recent admixture from an East Asian population, above and beyond what is typical in South Asians. Remember, all three of my genotypes are basically on the same spot on PCA plots. That’s because they’re basically the same. Genotyping error is rather low.

How do we summarize this sort of information for a regular person? The standard method today is giving people a set of proportions with specific population labels. Why? People seem to understand population labels and proportions, but can be confused by PCA plots. Additionally, the methods that give out populations and proportions are often better at capturing pulse admixture events relatively recent in time than PCA, and for most consumers of ancestry services, this is an area that they are particularly focused on (i.e., Americans).

An easy way to make one’s genetic variation comprehensible to the general public is to model them as a mixture of various populations that they already know of. So consider the ones above in the plink file. I ran ADMIXTURE in supervised model progressively removing populations for my three genotypes. The results are below.

 DaiDruzeGermanJapanesePapuanSardinianSuruiTamilYoruba
Razib23andMe11%3%8%4%1%0%1%73%1%
RazibAncestry10%2%8%4%1%0%1%73%1%
RazibFTDNA11%2%8%3%1%0%1%72%1%
          
 DaiDruzeGermanJapanesePapuanSardinianSuruiTamil 
Razib23andMe11%3%8%4%1%0%1%73% 
RazibAncestry10%3%8%4%1%0%1%74% 
RazibFTDNA11%3%8%3%1%0%1%73% 
          
 DaiDruzeJapanesePapuanSuruiTamil   
Razib23andMe10%9%4%1%1%74%   
RazibAncestry10%9%4%1%1%75%   
RazibFTDNA11%9%4%1%1%74%   
          
 DaiJapaneseSuruiTamil     
Razib23andMe11%4%1%84%     
RazibAncestry10%4%1%85%     
RazibFTDNA11%3%1%84%    

Please observe again that they are broadly congruent. These methods exhibit a stochastic element, so there is some noise baked into the cake, but with 200,000+ markers and a robust number of reference populations the results come out the same across all methods (also, 23andMe and Family Tree DNA seem to correlate a bit more, which makes sense since these two genotypes are more similar to each other than they are to Ancestry).

Observe that until I remove all other West Eurasian populations the Tamil fraction in my putative ancestry is rather consistent. Why? Because my ancestry is mostly Tamil-like, but social and historical evidence would point to the likelihood of some exogenous Indo-Aryan component. Additionally, seeing as how very little of my ancestry could be modeled as West African removing that population had almost no impact.

When there were three West Eurasian populations, Germans, Druze, and Sardinians, the rank order was in that sequence. Removing Germans and Sardinians and the Druze picked up most of that ancestral component. This a supervised method, so I’m assigning the empirical populations as reified clusters which can be used to reconstitute the variation you see in my own genotype. No matter what I put into the reference data, the method tries its best to assign proportions to populations.

The question then comes into the stage of subtle choices one makes to obtain the most informative inferences for the customer. These are not always matters of different results in terms of accuracy or precision, but often of presentation. If West Eurasian populations are removed entirely, my Tamil fraction inflates. That’s the closest to the West Eurasian populations left in the data. In contrast, the East Asian fraction remains the same because I’ve left the two proxy populations in the data (I rigged the die here because I know I have Tibeto-Burman admixture which is a combination of Northeast and Southeast Asian).

Let’s do something different. I’m going to swap out the West Eurasian populations with equivalents.

 ArmeniansDaiFrench_BasqueJapaneseMandenkaSuruiSwedenTamil
Razib23andMe6%11%0%4%1%1%5%72%
RazibAncestry5%11%0%4%1%1%5%73%
RazibFTDNA6%11%0%4%1%1%5%72%
         
GermanPapuanYoruba     
Razib23andMe68%20%13%     
RazibAncestry68%20%13%     
RazibFTDNA68%20%13%     
         
French_BasqueTamil      
Razib23andMe8%92%      
RazibAncestry7%93%      
RazibFTDNA8%92%      
         
TamilYoruba      
Razib23andMe97%3%      
RazibAncestry97%3%      
RazibFTDNA97%3%     

I have no ancestry from French Basque, but I do have ancestry from Armenians and Swedes in this model. Why? If you keep track of the most recent population genomic ancestry this all makes sense. But if you don’t, well, it’s harder to unpack. This is part of the problem with these sorts of tests: how to make it comprehensible to the public while maintaining fidelity to the latest research.

This is not always easy, and differences between companies in terms of interpretation are not invidious as some of the press reports would have you think, but a matter of difficult choices and trade-offs one needs to make to give value to customers. True, this could all be ironed out if there was a ministry of genetic interpretation and a rectification of names in relation to population clusters, but right now there isn’t. This allows for both brand differentiation and engenders confusion.

In most of the models with a good number of populations, my Tamil ancestry is in the low 70s. Notice then that some of these results are relatively robust to the populations one specifies. Some of the patterns are so striking and clear that one would have to work really hard to iron them out and mask them in interpretation. But what happens when I remove Tamils and include populations I’m only distantly related to? This is a ridiculous model, but the algorithm tries its best. My affinity is greatest to Germans, both because of shared ancestry, and in the case of Papuans, their relatively high drift from other East Eurasians and Denisovan ancestry. But both Papuan and Yoruba ancestry are assigned because I’m clearly not 100% German, and I share alleles with both these populations. In models where there are not enough populations to “soak up” an individual’s variation, but you include Africans, it is not uncommon for African ancestry to show up at low fractions. If you take Europeans, Africans, and East Asians, and force two populations out of this mix, then Europeans are invariably modeled as a mix of Africans and East Asians, with greater affinity to the latter.

Even when you model my ancestry as Tamil or Yoruba, you see that there is a Yoruba residual. I have too much genetic variation that comes from groups not closely related to the variation you find in Tamils to eliminate this residual.

Just adding a few populations fixes this problem:

 DaiTamilYoruba 
Razib23andMe14%83%2% 
RazibAncestry14%84%2% 
RazibFTDNA14%83%2% 
     
 DaiGermanTamilYoruba
Razib23andMe15%10%74%1%
RazibAncestry14%9%75%1%
RazibFTDNA15%10%74%1%

Notice how my Tamil fraction is almost the same as when I had included in many more reference populations. Why? My ancestral history is complex, like most humans, but it’s not that complex. The goal for public comprehensibility is to reduce the complexity into digestible units which give insight.

Of course, I could just say read Inference of Population Structure Using Multilocus Genotype Data. The basic framework was laid out in that paper 17 years ago for model-based clustering of the sort that is very common in direct to consumer services (some use machine learning and do local ancestry decomposition across the chromosome, but really the frameworks are an extension of the original logic). But that’s not feasible for most people, including journalists.

Consider this piece at Gizmodo, Why a DNA Test Is Actually a Really Bad Gift. I pretty much disagree with a lot of the privacy concerns, seeing as how I’ve had my public genotype downloadable for seven years. But this portion jumped out at me: “Ancestry tests are based on sound science, but variables in data sets and algorithms mean results are probabilities, not facts, as many people expect.”

Yes, there are probabilities involved. But if a DNA test using the number of markers above tells you you are 20% Sub-Saharan African and 80% European in ancestry, that probability is of the same sort of confidence of you determining that a coin flip is fair after 100,000 flips. True, you can’t be totally sure after 100,000 flips that you have a fair coin, but you can be pretty confident. With hundreds of thousands of markers, a quantum of 20% Sub-Saharan African in a person of predominantly European heritage is an inference made with a degree of confidence that verges upon certitude within a percentage or so.

As for the idea that they are not “facts.” I don’t even know what that means in this context. And I doubt the journalist does either. Which is one of my main gripes with these sorts of stories: unless they talk to a small subset of scientists the journalists just don’t know what they are talking about when it comes to the statistical genetics.

Finally, there is the issue about what does it even mean to be % percent of population X, Y, or Z? Even many biologists routinely reify and confuse the population clusters with something real and concrete in a Platonic sense. But deep down when you think about it we all need to recall we’re collapsing genealogies of many different segments of DNA into broad coarse summaries when we say “population.” And populations themselves are by their nature often somewhat open and subject to blending and flow with others. A population genomic understanding of structure does not bring into clarity Platonic facts, but it gives one instruments and tools to smoke out historical insight.

The truth, in this case, is not a thing in and of itself, but a dynamic which refines our intuitions of a fundamentally alien process of Mendelian assortment and segregation.

Razib Khan’s raw genotype data on 23andMe, Family Tree DNA, Geno 2.0 and Ancestry

It has been a while since I posted an update on my genotype. Since then I’ve been tested on most of the major platforms. I don’t see any harm in releasing this to the public or researchers who want to look at it (though I don’t know why anyone would).

You can download all the files here.

Having my genotypes public is pretty useful for me. If I inquire about someone’s genetics oftentimes people get weirdly defense and ask “what about you?” I Just invite them to look at my raw data and analyze it for themselves! I’m not a hypocrite about this.

Over the years I’ve had researchers inquire about my ethnicity when they stumble upon my genotype on platforms such as openSNP. So in full disclosure, most of my ancestry is pretty standard eastern Bengali. I’m more East Asian shifted than most Bangladeshi samples in the 1000 Genomes project, but then my family is from Comilla, in the far east of eastern Bengal (anyone who cares, my Y is of course R1a1a-Z93 and my mtDNA U2b).

As before I’ll put the genotype under a Creative Commons license:Creative Commons License

$9.99 to get into the Helix exome ecosystem

Will try to keep self-interested product placement to a minimum normally, but I thought I’d pass on that Helix has a $100 off sale for the next 72 hours. That means that the company I work for has a Neanderthal app on sale for $9.99. The regular price is $29.99, and added $80.00 for exome+ sequencing if you aren’t in the Helix database (which most people are not).

The upshot here is that the $9.99 will get you an exome+ sequence, which at some point in early 2018 you can download for $600. But if you don’t want to download it it’s a great way to get into the ecosystem on-the-cheap.

I assume most of my readers know what the exome is, but it’s basically the portion of your genome which is directly translated into functional proteins. That’s about ~1% of the genome, or ~30,000,000 bases. This is a major expansion on the SNP-chip platforms which are DTC which are in the 500,000 to 1,000,000 SNP range.

Anyway, not sure this will be appealing to readers who need a full download of data. But if you are the type who is more interested in getting applications related to your genome, this is a pretty good deal at a sub-$10 price point.

Note: To my knowledge only ships to USA currently.

Introducing DNAGeeks.com

Four years ago my friend David Mittelman and I wrote Rumors of the death of consumer genomics are greatly exaggerated. The context was the FDA crackdown on 23andMe. Was the industry moribund before it began? The title gives away our opinion. We were personally invested. David and I were both working for Family Tree DNA, which is part of the broader industry. But we were sincere too.

Both of us have moved on to other things. But we still stand by our original vision. And to a great extent, we think we had it right. The consumer genomics segment in DTC is now nearing 10 million individuals genotyped (Ancestry itself seems to have gone north of 5 million alone).

One of the things that we observed in the Genome Biology piece is that personal genomics was still looking for a “killer app”, like the iPhone. Since then the Helix startup has been attempting to create an ecosystem for genomics with a variety of apps. Though ancestry has driven nearly ten million sales, there still isn’t something as ubiquitous as the iPhone. We’re still searching, but I think we’ll get there. Data in search of utility….

David and I are still evangelizing in this space, and together with another friend we came up with an idea: DNAGeeks. We’re starting with t-shirts because it’s something everyone understands, but also can relay our (and your) passion about genomics. We started with “Haplotees.” Basically the most common Y and mtDNA lineages. This might seem silly to some, but it’s something a lot of people have an interest in, and it’s also a way to get ‘regular people’ interested in genetics. Genealogy isn’t scary, and it’s accessible.

We are also field-testing other ideas. If there is a demand we might roll out a GNXP t-shirt (logo only?). The website is obscure enough that it won’t make sense to a lot of people. But perhaps it will make sense to the people who you want it to make sense to!

Anyway, as they say, “keep watching this space!” We don’t know where DNAGeeks is going, but we’re aiming to have fun with genomics and make a little money too.