Notes on the Common Disease-Common Variant debate: two years later

Share on FacebookShare on Google+Email this to someoneTweet about this on Twitter

Just over two years ago, I wrote a brief post explaining why I find the “debate” about common variants versus rare variants in human medical genetics to be largely unhelpful. I concluded thusly, after explaining some of the rationale for looking for common variants that affect disease susceptibility:

So am I then arguing in favor of the CDCV [Common Disease-Common Variant] hypothesis? Of course not– rare variants, aside from being predictive for disease in some individuals, also give important insight into the biology of the disease. But it is possible right now, using genome-wide SNP arrays and databases like the HapMap, to search the entire genome for common variants that contribute to disease. This is an essential step–finding the alleles that contribute disproportionately to the population-level risk for a disease. Eventually, the cost of sequencing will drop to a point where rare variants can also be assayed on a genome-wide, high-throughput scale, but that’s not the case yet. Once it is, expect the CDRV [Common Disease-Rare Variant] hypothesis to be trumpeted as right all along.

Well, two years later, the price of sequencing has dropped precipitously. And in this week’s New England Journal of Medicine, David Goldstein makes the argument that association studies using common variants have been disappointing and what people really need to be doing is–would you believe it?–searching for rare variants using sequencing.

Your opinion about the current crop of genome-wide association studies depends, of course, on what you were expecting to begin with: if you thought that a few common variants would be discovered for each common disease and fully explain its prevalence, you’re likely to think the whole enterprise has been a bust (with a few exceptions, of course–Goldstein mentions exfoliation glaucoma and macular degeneration). If, on the other hand, you thought that genome-wide association studies would have about as much success as the linkage and candidate gene studies that preceded them (Daniel Macarthur characterized the field as a “scientific wasteland” prior to 2005, and that’s only mild hyperbole), you’re probably astounded by their success.

In any case, the objections to large association studies are/have been numerous, but Goldstein has come up with the most bizarre one yet–that large association studies using common variants might find too many things! The premise is this (and let’s take a non-disease trait like height as an example): current association studies have identified many loci of small effect that influence human height. Together, these loci account for ~3% of the population variation in height. Assuming these are the largest effect sizes out there to find, and an exponential distribution on effect sizes (both probably approximately fair assumptions), then a massive number of loci influence height, potentially genes across the entire genome. Thus, “[i]f common variants are responsible for most genetic components of type 2 diabetes, height, and similar traits, then genetics will provide relatively little guidance about the biology of these conditions, because most genes are ‘height genes’ or ‘type 2 diabetes genes.’”

The solution to this problem, Goldstein claims, is to look for rare variants that (he presumes) have larger effects. This claim, though it appears reasonable, is a non sequitur. The reason why is that Goldstein is conflating two definitions of effect size. In definition one, effect size is defined as the proportion of variance in a trait explained by a polymorphism. In definition two, effect size is defined as the difference in mean trait value between two genotype classes. Why is this a problem? Because the proportion of variance in a trait explained by a polymorphism is a function both of its frequency and the impact it has on the trait [1]. To re-use a previous example, imagine smoking cigarettes gives you a 5% chance of developing lung cancer, while working in an asbestos factory gives you a 70% chance. In sense one, smoking has a larger effect size–since so many more people smoke than work in asbestos factories, the number of lung cancer cases due to smoking is much higher than the number due to asbestos. However, under definition two, working in an asbestos factory has the larger effect size–the probability of developing the disease is much higher. Thus, though a rare polymorphism might have a large effect (in sense 2), it will explain a tiny amount of the variance in the trait simply due to the fact it is rare [2].

The contention that the number of loci needed to explain the heritability of a trait will somehow be smaller if one looks at rare variation is simply false.

[1] Assuming additivity, the variance explained by a locus is 2p(1-p)a^2, where p is the allele frequency and a is half the difference between the means of each homozygote. See Figure 4.8 of Lynch and Walsh.

[2] For example, let’s use the equation in [1] and assume a polymorphism has a frequency of 0.001%. Then, in order for this polymorphism to account for 0.05% of the variation in height (on the small end of the proportions accounted by common polymorphisms identified to date), a single allele would have to increase height by a whopping 5 standard deviations.

Labels:

7 Comments

  1. “Thus, though a rare polymorphism might have a large effect (in sense 2), it will explain a tiny amount of the variance in the trait simply due to the fact it is rare.” 
     
    This is true, but it is not Goldstein’s point. He is searching for core signaling pathways that can be manipulated by drugs, not doing population statistics. 
     
    The easiest way to identify those core pathways is to find variations that confer a severe loss (or gain) of function phenotype with high probability—a large “effect” in his article. And those deleterious variations will almost always be rare, since natural selection usually does not favor common deleterious variations.

  2. The easiest way to identify those core pathways is to find variations that confer a severe loss (or gain) of function phenotype with high probability 
     
    i think you’re referring to monogenic forms of complex diseases. while they exist (and have been studied successfully for some time), they are not the focus of Goldstein’s argument. (it’s often helpful to look at the figures of a paper to decide what what author thinks his/her main points are. In this case, Goldstein has one figure–a plot of the exponential decay of fraction of variance explained) 
     
    if, for example, many cases of type II diabetes were due to severe mutations with high penetrance, that would have been noticed in pedigrees. Some instances have indeed been noticed [link], but they explain a relatively small fraction of all cases.  
     
    I do think looking at monogenic forms of disease is helpful. But that’s neither here nor there.  
     
    And note Goldstein does mention that common polymorphisms can identify interesting genes/drug targets:Some experts emphasize that small effect sizes don’t necessarily mean that a gene variant is of no interest or use. Effect size is a function of what a variant does: it may change only slightly a gene’s expression or a protein’s function. The gene’s pathway, however, may be decisive for a particular condition, or pharmacologic action on the same protein may produce much larger effects in controlling disease. These arguments are reasonable, as far as they go, and there are supporting examples, such as a polymorphism of modest effect in PPARG, a gene that encodes a drug target for diabetes.

  3. Ah. I see what you’re saying after reading it again when not sleepy. 
     
    Goldstein does say “Another possibility, however, is that some of the associations that are credited to common variants are actually synthetic associations involving multiple rare variants that occur, by chance, more frequently in association with one allele at a common SNP than with the other.” That is a bit of a reach, but he does not seem dead set on the number of “important” loci being small. 
     
    In any event, Cheap Sequencing Now!

  4. Hey p-ter, 
     
    I’m not sure that Goldstein actually directly conflates per-allele effect size with the fraction of variance explained.  
     
    I can’t say for sure, but I imagine he’s thinking about variants at a frequency of perhaps 0.1% with a per-allele effect of, say, 0.5 SD (about an inch of adult height, IIRC). Each such variant would explain 0.5% of the population variance and yet would be essentially completely undetectable using current GWAS.  
     
    There are various other permutations of frequency and effect size that would have the same sort of effect – basically, a lot of the variants that fall (in terms of both frequency and per-allele effect size) somewhere in the vast grey area between completely penetrant Mendelian variants and common variants with very low ORs. 
     
    Will these variants explain all of the missing heritability? I strongly doubt it, but I do think it’s reasonable to expect them to explain a reasonable chunk of it – and as I said at the end of my post, because of their large effect sizes these variants will actually be much more useful than common low-OR variants at informing individual health predictions.

  5. In any event, Cheap Sequencing Now! 
     
    agreed.  
     
    in some sense, being an advocate for sequencing like Goldstein is trying to do is like being an advocate for PCR–people are going to do in no matter what. ultimately, the study design for any disease is going to be full sequencing of tens of thousands of cases and controls (and discovery of both rare and common variants). Someone is going to assemble those massive cohorts, and why not genotype them on a standard chip once you’ve got them assembled? Sure, you’ll miss some things (which you might then find by sequencing those same individuals once the price gets reasonable), but you’ll probably also find a lot of interesting things.

  6. I can’t say for sure, but I imagine he’s thinking about variants at a frequency of perhaps 0.1% with a per-allele effect of, say, 0.5 SD (about an inch of adult height, IIRC). Each such variant would explain 0.5% of the population variance and yet would be essentially completely undetectable using current GWAS. 
     
    hm, yes, true. but then those effect sizes are still subject to his exponential decay curve–there will still be 93,000 variants affecting a trait. If there are 93,000 rare variants spread across 5,000 genes (or whatever) that affect a trait, I don’t see why that leads to more biological insight than 93,000 common variants spread across 5000 genes.  
     
    actually, i may still not be getting his argument. if we assume he’s not making the mistake I say he is, then (still following his example and model), there are ~100K variants, both rare and common, that affect height. If we identify those variants and they cluster together in pathways, then we gain biological insight. If they don’t cluster together, then we don’t. I guess we’ll see, and I presume they will. But what does this have to do with the frequency of the variants? It’s perfectly possible that rare variants in many, many genes impact a phenotype, no?

  7. ok, i’m re-reading this. he’s definitely implying that his calculations on the number of SNPs influencing a trait only apply if the SNPs are common, no? 
     
    Though the strongest SNP may have been found, many SNPs could remain unidentified in the range of the lower effects that have been determined. If such SNPs are accounted for, fewer SNPs will be required to explain a given proportion of variance. The sample sizes that have been studied for height, however, range from 14,000 to 34,000. At the lower sample size, the power of detection is 90% for the largest effect size; for effect sizes as small as 0.05%, the largest sample size provides a 10% chance of detection. Even if we conservatively assume that all remaining unidentified variants influencing height each explained as much as 0.05% of the variation, 1500 such variants would be required to explain the missing heritability. These calculations also assume that the effects of “height SNPs” are additive. If variants show meaningful interactions, a somewhat stronger genetic effect could emerge among variants with small individual effect sizes. But only dramatic departures from these assumptions would allow a manageable number of common SNPs to account for a sizable fraction of the heritability of height.

a