Friday, April 17, 2009

Notes on the Common Disease-Common Variant debate: two years later   posted by p-ter @ 4/17/2009 08:07:00 PM

Just over two years ago, I wrote a brief post explaining why I find the "debate" about common variants versus rare variants in human medical genetics to be largely unhelpful. I concluded thusly, after explaining some of the rationale for looking for common variants that affect disease susceptibility:
So am I then arguing in favor of the CDCV [Common Disease-Common Variant] hypothesis? Of course not-- rare variants, aside from being predictive for disease in some individuals, also give important insight into the biology of the disease. But it is possible right now, using genome-wide SNP arrays and databases like the HapMap, to search the entire genome for common variants that contribute to disease. This is an essential step--finding the alleles that contribute disproportionately to the population-level risk for a disease. Eventually, the cost of sequencing will drop to a point where rare variants can also be assayed on a genome-wide, high-throughput scale, but that's not the case yet. Once it is, expect the CDRV [Common Disease-Rare Variant] hypothesis to be trumpeted as right all along.
Well, two years later, the price of sequencing has dropped precipitously. And in this week's New England Journal of Medicine, David Goldstein makes the argument that association studies using common variants have been disappointing and what people really need to be doing is--would you believe it?--searching for rare variants using sequencing.

Your opinion about the current crop of genome-wide association studies depends, of course, on what you were expecting to begin with: if you thought that a few common variants would be discovered for each common disease and fully explain its prevalence, you're likely to think the whole enterprise has been a bust (with a few exceptions, of course--Goldstein mentions exfoliation glaucoma and macular degeneration). If, on the other hand, you thought that genome-wide association studies would have about as much success as the linkage and candidate gene studies that preceded them (Daniel Macarthur characterized the field as a "scientific wasteland" prior to 2005, and that's only mild hyperbole), you're probably astounded by their success.

In any case, the objections to large association studies are/have been numerous, but Goldstein has come up with the most bizarre one yet--that large association studies using common variants might find too many things! The premise is this (and let's take a non-disease trait like height as an example): current association studies have identified many loci of small effect that influence human height. Together, these loci account for ~3% of the population variation in height. Assuming these are the largest effect sizes out there to find, and an exponential distribution on effect sizes (both probably approximately fair assumptions), then a massive number of loci influence height, potentially genes across the entire genome. Thus, "[i]f common variants are responsible for most genetic components of type 2 diabetes, height, and similar traits, then genetics will provide relatively little guidance about the biology of these conditions, because most genes are 'height genes' or 'type 2 diabetes genes.'"

The solution to this problem, Goldstein claims, is to look for rare variants that (he presumes) have larger effects. This claim, though it appears reasonable, is a non sequitur. The reason why is that Goldstein is conflating two definitions of effect size. In definition one, effect size is defined as the proportion of variance in a trait explained by a polymorphism. In definition two, effect size is defined as the difference in mean trait value between two genotype classes. Why is this a problem? Because the proportion of variance in a trait explained by a polymorphism is a function both of its frequency and the impact it has on the trait [1]. To re-use a previous example, imagine smoking cigarettes gives you a 5% chance of developing lung cancer, while working in an asbestos factory gives you a 70% chance. In sense one, smoking has a larger effect size--since so many more people smoke than work in asbestos factories, the number of lung cancer cases due to smoking is much higher than the number due to asbestos. However, under definition two, working in an asbestos factory has the larger effect size--the probability of developing the disease is much higher. Thus, though a rare polymorphism might have a large effect (in sense 2), it will explain a tiny amount of the variance in the trait simply due to the fact it is rare [2].

The contention that the number of loci needed to explain the heritability of a trait will somehow be smaller if one looks at rare variation is simply false.

[1] Assuming additivity, the variance explained by a locus is 2p(1-p)a^2, where p is the allele frequency and a is half the difference between the means of each homozygote. See Figure 4.8 of Lynch and Walsh.

[2] For example, let's use the equation in [1] and assume a polymorphism has a frequency of 0.001%. Then, in order for this polymorphism to account for 0.05% of the variation in height (on the small end of the proportions accounted by common polymorphisms identified to date), a single allele would have to increase height by a whopping 5 standard deviations.