## Why are most genetic associations found through candidate gene studies wrong?

In a recent post, I made a blanket statement that the vast majority of candidate gene association studies published in psychiatric genetics (actually, in nearly all fields of genetics) are wrong. I’m not just being offhandedly dismissive–below, I outline the statistical argument behind that claim. This discussion is cribbed almost verbatim from a discussion of the issue by statisticians at the Welcome Trust.

Let’s assume that there are a finite number of loci in genome, and we test some number of those (in a genome-wide association study, this is on the order of 500K-1M; in a candidate gene study it’s more likely in the tens. But the actual marker density is irrelevant for what follows) for association with some phenotype of interest. In general, the criterion used to decide if one has discovered a true association is the p-value, or the probability of seeing the data that you have given that there is no association. But that’s not really the quantity you’re interested in. The real quantity of interest is the probability that there’s a true association given the data you see–the inverse of what’s being reported.

By Bayes’ Law, this probability depends on the prior probability of an association at that marker, the p-value threshold you’ve chosen to call a finding “significant”, and crucially, the power you had to detect the association [1][2]. **Thus, the interpretation of a given p-value depends on the power to detect an association, such that the lower your power, the lower the probability that a “significant” association is true** [3].

That’s where recent evidence from large genome-wide association studies comes into play. For nearly all diseases, reproducible associations have small effect size and are only detectable when one has sample sizes in the thousands or tens of thousands (for many psychiatric phenotypes, even studies with these sample sizes don’t seem to find much). The vast majority of candidate gene association studies had sample sizes in the low hundreds, and thus had essentially zero power to detect the true associations. By the argument above, in this situation the probability that a “significant” association is real approaches zero. **The problem with candidate gene association studies is not that they were only targeting candidate genes, per se, but rather that they tended to have small sample sizes and were woefully underpowered to detect true associations.**

[1] Let D be the data, T be the event that an association is true, t, be the event that an association is not true, and P(T) be the prior probability that an association is true.

P(T|D) = P(D|T)P(T) / [ P(D|T) P(T) + P(D|t) (1-P(T) ]

P(D|T) is the power, and P(D|t) is the p-value. Clearly, both are relevant here.

[2] http://jnci.oxfordjournals.org/cgi/content/full/96/6/434#FD1

[3] As the authors note,

A key point from both perspectives is that interpreting the strength of evidence in an association study depends on the likely number of true associations, and the power to detect them which, in turn, depends on effect sizes and sample size. In a less-well-powered study it would be necessary to adopt more stringent thresholds to control the false-positive rate. Thus, when comparing two studies for a particular disease, with a hit with the same MAF and P value for association, the likelihood that this is a true positive will in general be greater for the study that is better powered, typically the larger study.In practice, smaller studies often employ less stringent P-value thresholds, which is precisely the opposite of what should occur.

Labels: Genetics

ben gJune 21st, 2009 at 4:51 pmmany of the candidate gene studies use a technique like Monte Carlo permutation tests or Boniferoni correction to address this issue. should we take those ones seriously?

p-terJune 21st, 2009 at 6:39 pmThat doesn’t really address the issue–that addresses the significance threshold you choose. but the question is: given that a marker passes that significance threshold, what is the probability that it’s a real association?

I guess the issue is: people worry a lot about reporting proper p-values, but the interpretation of a p-value is not the same in every study–for these phenotypes where effect sizes are small, a p-value (even a corrected p-value) of 0.05 doesn’t mean the same thing in a small study as in a large study.

TGGPJune 21st, 2009 at 7:16 pmRight now Andrew Gelman has a post titled “The sample size is huge, so a p-value of 0.007 is not that impressive”.

p-terJune 21st, 2009 at 7:53 pmyes, he’s making a good point about models almost always being wrong in some way. that said, genotype-phenotype associations are like the birth ratio example he gives–the null hypothesis of no correlation between genotype and phenotype is almost always completely correct, and we’re looking for really tiny correlations that denote real effects. (this is why things like population structure, which cause the null to be modeled slightly incorrectly in some cases, are such a problem)

ben gJune 21st, 2009 at 9:27 pmhmm, i thought that those tests make it so that the prior odds of a non-real association popping up in its place are taken into account. is that wrong?

statsquatchJune 22nd, 2009 at 1:33 pmRegarding GelmanÂ’s comment, if you have data sets that are Â“too bigÂ” you can find very low p-values, (e.g., it is unlikely that you observe data that is more extreme if the null hypothesis is true), for effects that are meaningless.

I heard a grumpy statistician once say Â“p-values should only be shared privately between consenting statistician.Â” He thought the little people should stick with estimates of effect size, standard errors, and confidence intervals.

howserJune 22nd, 2009 at 4:18 pmben g:

To echo p-ter’s point: permutation testing can account for the multiple testing issue (thereby ensuring strict control over the false positive rate, at least in the long run), but it does not circumvent the drawbacks of simplistic frequentist hypothesis testing. I’ll try to illustrate with a pathological scenario:

Imagine a world in which a bunch of labs are performing candidate gene association studies with small sample sizes and essentially no power to detect SNP effects on the phenotype of interest. Permuted p-values can ensure that, across all of these studies, only a fixed percentage of “statistically significant” associations will be false, but there is no guarantee that

anyof them will be real. Hence, the scientific community can march along, with 5% of studies producing SNPs with PhowserJune 22nd, 2009 at 4:42 pmOops, looks like I cut off my own comment due to my lack of proficiency at html tags. Continuing…

…Hence, the scientific community can march along, with 5% of studies producing SNPs with P<0.05, and not a single one of the reported associations will hold up in a replication sample.

Conversely, imagine another world in which large samples are easy to obtain, greatly increasing the power of association studies. If we tested exactly the same SNPs as before (thereby introducing no new multiple testing issues), we would expect to see a larger proportion of SNPs with P<0.05 (assuming that there actually were true susceptibility loci in the data). And, contrary to the previous scenario, many of the reported associations would survive replication.

So, the same significance threshold can yield meaningfully different results, and this implies that a p-value alone doesn’t tell you the whole story…which makes sense! The p-value only claims to control the false positive rate, but that doesn’t tell you anything about power, which is crucial to assessing the probability that a putative association is “real”. This is why most people who have thought hard about the problem insist on using Bayes Factors or measuring the false discovery rate, both of which force you to grapple in some way with the “alternative” hypothesis/distribution.

ben gJune 22nd, 2009 at 4:51 pmPermuted p-values can ensure that, across all of these studies, only a fixed percentage of “statistically significant” associations will be false, but there is no guarantee that any of them will be real.Sorry, I don’t understand this statement. If only a fixed percentage of the associations are false, how could none of them be real?

p-terJune 22nd, 2009 at 4:56 pmIf only a fixed percentage of the associations are false, how could none of them be real?i think that must be a typo: only a fixed percentage of the false tests will pass your threshold, but there’s no guarantee that

anyof those that do pass the threshold are real.howser,

greater then/less than signs get interpreted as html tags…

ben gJune 22nd, 2009 at 5:08 pmok. so why is it that the multiple-testing corrected threshold let’s false associations through? isn’t such a threshold created so as to avoid the problems described here as far as non-real associations being reported as significant?

i’m a newb when it comes to these statistics, so i don’t mean to belabor the question.. if there’s something i can read that explains why the multiple testing corrections don’t address this then ill just read that..

howserJune 22nd, 2009 at 5:26 pmben,

Sorry about that — I was trying to provide an example that would explain why multiple testing is actually peripheral to the question at hand, and it might have actually worked if I hadn’t chopped my post in half. Perhaps the full post, now shown above in two parts, will clear things up?

Here’s a slight re-wording of the point: Multiple testing corrections prevent rampant false positives, bringing you back to the nominal false positive rate that you specify by choosing an overall significance threshold. However, some false positives will still get through, by design — you’ve simply specified the false positive rate (specifically, the probability of calling at least one false positive association “significant”) that you’re willing to tolerate by setting a significance threshold. And, if your study power is low, the false positives might be the only things that get through.

p-ter: Thanks for the correction. Typo acknowledged, and I’ll also point out that the “5% of studies” remark is wrong too.

statsquatchJune 22nd, 2009 at 6:11 pmBonferroni and other multiple comparison adjustments will ?account? for false discovery rates by making sure the threshold for significance is high. I think of a single p-value as the probability under the null hypothesis (no association in this case) that you observe results that are more extreme evidence against the null than the results observed. So when you calculate a p-value you assume that the effect is not real and look at the probability that you observe what you did, this is p-value= P(D (ata) |t (null) ) in the original post, assuming the effect is not real. If you calculate lots of p-values in a given study the chance that you get something extreme by chance is large and multiple comparison try to account for this by, in effect, requiring more extreme p-values, but it is always possible that they null hypothesis is true.

p-terJune 22nd, 2009 at 7:17 pmimagine you’re getting a test for a fatal rare disease, and that the test is 99% accurate. the test comes back positive. how worried should you be? not very.

why? The p-value– P(positive result | no disease) is not the same as what you want to know: P(no disease| positive result).

this same principle is at work in the association study context.

ben gJune 22nd, 2009 at 8:22 pmAnd, if your study power is low, the false positives might be the only things that get through.Why don’t they alter the correction statistics according to the power of the study, then?

p-ter,

thanks for the clarifying metaphor. is there a way to expand that metaphor so as to explain the role of the correction statistics?

p-terJune 22nd, 2009 at 9:22 pmthanks for the clarifying metaphor. is there a way to expand that metaphor so as to explain the role of the correction statistics?well, maybe…imagine you want to be really, really stringent about false positives because following up a positive test costs thousands of dollars, so you come up with a test that only comes up positive in 0.00001% of people who don’t have the disease. still, you can do the calculations in that link–depending on the other parameters (ie. what if it only comes up positive in 0.1% of people that

dohave the disease?) it’s still likely that you don’t have anything to worry about despite a positive result.i think the main point is just that multiple test corrections are intended only to allow you to have properly-calibrated p-values (ie. you want to get p

Steve SailerJune 23rd, 2009 at 1:43 amYou can insist that researchers randomly divide their samples in half, and do their own replication studies. That would reduce the false positive rate from 5% to 5% of 5%.

bgcJune 23rd, 2009 at 2:16 amSteve Sailer has the right idea – the answer is not to mess about with quibbles over the statistical analysis of single studies, but to replicate. There is no substitute.

p-terJune 23rd, 2009 at 6:49 amwell, yes, replication is important, the question is what is worth trying to replicate (the motivation for this post was an association that had conflicting reports of replication in the literature for 8 years).

and throwing out half your data (ie. reducing your power even more) to toy around with “statistical significance” is a bit silly.

gabrielJune 23rd, 2009 at 9:49 amI have a simulation of publication bias that may be of interestre.wordpress.com/2009/03/17/publication-bias/

http://codeandcultu

howserJune 23rd, 2009 at 9:58 amben g,

And, if your study power is low, the false positives might be the only things that get through.Why don’t they alter the correction statistics according to the power of the study, then?This is essentially what the Wellcome Trust statisticians recommended in the box that p-ter linked to in this post. The last sentence reads:

“In practice, smaller studies often employ less stringent P-value thresholds, which is precisely the opposite of what should occur.”

Here’s a bit more explanation of that statement: The smaller (candidate gene) studies historically employed less stringent p-value thresholds because they assayed many fewer SNPs than modern genome-wide studies, which led to less severe multiple testing corrections. However, the argument is that the number of

individualsin the study, not just the number of SNPs/tests, should influence the choice of significance threshold since the number of individuals is a major determinant of power.So, the claim is that a low-powered study (say, with hundreds rather than thousands of individuals) should require more extreme p-values than implied by the multiple testing correction alone, simply because we require more extreme evidence to believe that an association is real (and not one of those unlucky false positives) when our study’s power is low.

DerekJune 23rd, 2009 at 10:39 amp-ter, don’t agree completely, sometimes you need to throw out data. It depends on constructing the original biological hypothesis correctly.

For example, imagine a GWAS of a certain type of cancer. You might have large statistical power, but your homogeneous sample would only allow you to detect common SNPs of high penetrance. If you instead focused on a certain tumor type in individuals with the same demographics (ethnicity, other environmental risk factors for that particular tumor, etc..), you have reduced your statistical power dramatically but wouldn’t you be much better placed to find specific genetic risk factors for that tumor type? And wouldn’t that finding be much more likely to replicate?

Lots of the non-replications you talk about are related to variance in study design and differences in measurement of risk factors and outcomes. Not all studies that claim to measure the same thing actually do.

AMacJune 23rd, 2009 at 10:41 amhowser (10:03am),

Does the unwarranted belief in the meaning of small P values get back to the notion that genes of strong effect are likely to be found for most common diseases, if we but look?

This concept has been slammed early and often at GNXP, on evolutionary grounds. Still, it seemed to be the common wisdom at the time that Affy and Illumina were first rolling out their high-capacity chips and lowering the per-sample prices, thus making GWASs progressively more affordable and practical.

If I

know(or “strongly suspect”) that my groundbreaking GWAS of Type II diabetes is going to identify a couple of genes that contribute to its development or severity, I might not think hard enough about the null hypotheses that P’s refer to.Re: Steve Sailer’s 1:48am comment, it doesn’t seem that the suggestion of halving sample size would lead to that much of an increase in spurious findings that achieve a given level of significance, compared to the numbers that would be dropped because they weren’t replicated in-study.

howserJune 23rd, 2009 at 1:52 pmAMac,

I’ll start with a disclaimer, which is that I wasn’t involved in association studies when candidate genes were all the rage, so I can’t say for sure why people did the things they did. You should therefore view the rest of my comment as informed speculation.

One possible explanation for why investigators were too lenient with their significance thresholds is that they just didn’t have the statistical savvy to understand arguments like those outlined above. If so, they might have performed a multiple testing correction, which they understood to be the “correct” statistical practice, and left it at that. That’s the “naive statistics” scenario.

As you’ve pointed out, another way to justify lenient significance thresholds is to claim that common susceptibility variants of high penetrance exist, and are perhaps even widespread. These kinds of signals are the easiest ones to find in an association study, so there might be decent power to detect them (if they existed) even in quite a small study.

A third possible rationale for lenient thresholds is to claim that the prior odds of there being a real disease variant in a candidate gene region are much higher than in the rest of the genome. For example, the WTCCC argued that the prior odds of any SNP being associated in a GWAS is about 100,000:1, and this ratio motivated their stringent significance criterion. However, if the odds were greatly improved in a candidate gene (say, to 100:1), you would need much less evidence to be convinced that a putative association was real.

The first of these possibilities is simply a mechanism by which things could have gone wrong, and the next two are coherent, if dubious, rationales for applying less stringent thresholds. Can either of these assumptions about genotype/disease associations be justified? In retrospect, looking at the largely unsuccessful field of candidate gene associations and what we’ve learned in GWAS, I would say that they can’t. Of course, another big point is that clear elucidation of these assumptions might have clarified what the early association studies could and could not have hoped to find.

howserJune 23rd, 2009 at 2:19 pmIf I know (or “strongly suspect”) that my groundbreaking GWAS of Type II diabetes is going to identify a couple of genes that contribute to its development or severity, I might not think hard enough about the null hypotheses that P’s refer to.I agree that certain GWAS findings rise above any statistical quibbles about significance thresholds and so forth, but I think there’s still a strong incentive to think hard about the more borderline associations.

For example, consider that many of the major GWAS (circa 2007) were performed in the face of strong competing studies of the same diseases. In that context it would look pretty bad to miss compelling, if not completely convincing, associations due to a sloppy statistical analysis, especially after devoting piles of money to putting the study together.

Also, I would just like to point out that significance thresholds, in the classical hypothesis testing framework, are poorly suited to modern GWAS. First, consider that these thresholds are designed to minimize the chances of calling even a single false signal “significant”, whereas most GWAS don’t mind a few false positives as long as there are some true positives in the mix. Second, genotyping cost structures often favor bulk rather than targeted experiments, which means that it may cost the same amount to follow up your top

Nhits (regardless of significance threshold) as to focus on a smaller number of hits that pass some nominal threshold.Which is all just to say that, in a well-powered study, SNP

rankingscan be much more important than deciding on the “correct” p-value threshold.ben gJune 23rd, 2009 at 5:19 pmtell me if i have this right:

multiple-testing corrections address the issue of p-values being biased by the study itself, but doesn’t address the prior odds issue.

perhaps this is a proper extension of your metaphor, p-ter: there are 10 people who are planning on going on a boating trip together, and they need to see if any of them have a very rare (1 in a billion) contagious disease first. theyre all tested for the disease, with a test that is accurate in giving a positive result 99% of the time. multiple testing is used to ensure that the 10 hypothesis tests are corrected for multiple testing. even after that correciton, it’s unlikely that if a person gets a positive result that it’s truly positive.

p-terJune 23rd, 2009 at 6:21 pmIf you instead focused on a certain tumor type in individuals with the same demographics (ethnicity, other environmental risk factors for that particular tumor, etc..), you have reduced your statistical power dramatically but wouldn’t you be much better placed to find specific genetic risk factors for that tumor type?well, if you’re interested in a certain tumor type, throwing out people who don’t have that type will

increaseyour power. i didn’t think that’s what was being suggested.p-terJune 23rd, 2009 at 6:26 pmperhaps this is a proper extension of your metaphor, p-ter: there are 10 people who are planning on going on a boating trip together, and they need to see if any of them have a very rare (1 in a billion) contagious disease first. theyre all tested for the disease, with a test that is accurate in giving a positive result 99% of the time. multiple testing is used to ensure that the 10 hypothesis tests are corrected for multiple testing. even after that correciton, it’s unlikely that if a person gets a positive result that it’s truly positive.yes, that’s right. the multiple testing correction ensures that if you took random groups of 10 people without the disease and tested them all, only 5% (or whatever) of those groups would show up as having the disease (whereas if you didn’t do the correction, it would be more like 50%). but intuition (and statistics) suggests that even if you get a positive result, you probably don’t have to worry.

nooffensebutJune 23rd, 2009 at 8:26 pmIf one agrees that replication is the solution, then one typically turns to meta-analyses to resolve conflicting studies. Rather than disprove an association, studies with non-significant results will combine in the meta-analysis to narrow the confidence interval. Thus, they may actually support the association. This is illustrated by the Kim-Cohen meta-analysis for the gene interaction involving MAOA. This study included the largest of the three studies that unequivocally found non-significant results out of 13 or so addressing the 3-repeat and 4-repeat alleles in white males.

Another important point for studies of gene and environment interactions is that a narrow definition of the environmental exposure can create a bias towards the null hypothesis. After all, the subjective assessment that really counts is that of each subject because their subjective experience determines their internal milieu. That is why I would tend to place greater faith in the Sjoberg et al study, despite a relatively small sample size, because it traded the environmental exposure for an objective measurement of testosterone and found a significant interaction effect with MAOA.

I am also concerned about political bias determining what research is funded and by how much. I suspect this would affect sample size. Recall the plight of the Human Genome Diversity Project. I have sensed something similar, as I researched the MAOA gene. For example, Widom identified her subjects? racial background in 1989. She added them to her 2002 study on MAOA, but did not identify any of the subjects by race this time other than to call them white or ?non-white.? Likewise, Ellis and Nyborg felt they needed to defend their study of racial differences in testosterone levels, even as they advised scientists to be ?on guard against even the hint of any misuse of research findings.? Are these studies responding to political pressure?

DerekJune 24th, 2009 at 7:27 am‘well, if you’re interested in a certain tumor type, throwing out people who don’t have that type will increase your power. i didn’t think that’s what was being suggested.’

P-ter, sorry I was being facetious, but I think this is an important point that a lot of people overlook, and while no sane person would advocate combining heterogenous groups to yield larger sample sizes (and assume they increase their ‘power’), it is common practice in certain fields. Take the WTCCC GWAS of bipolar disorder. Bipolar is a heterogeneous disorder, and individuals with a bipolar diagnosis can present with very different symptoms. This is a fundamental biological problem that needs to be addressed before you can accurately determine power and thresholds for statistical significance. It really wasn’t surprising nothing exciting came of it.

The same is true of candidate gene studies. Inconclusive replications are in a large part due to incompatible study design- researchers think they are looking at the same disorder because they have an ad hoc measure of, for example, depression, but depressed individuals can be as different as those presenting with squamous cell carcinoma and small cell lung carcinoma; their symptoms, progonsis, risk factors, age of onset and drug response all vary greatly. Combine that with the implementation of a wide range of measurement tools (that will all define individuals a little differently), and one has a big problem on one’s hands.

My point is that you wouldn’t lump generic cancer diagnoses together as we know that different tumors have different biological signatures, but researchers do that with psychiatric diagnoses all the time, then look solely for statistical reasons as to why they don’t yield consistent results.

AMacJune 24th, 2009 at 7:30 amThinking about the 99%-specific test on the boating trip analogy:

Suppose that I’m the PI on a couple-of-hundred-patient GWAS for a common disease, using a 200,000-feature chip. rs1234 is the SNP with the lowest P-value, let’s say 10^-4. Say that 5% of controls are heterozygous for the minor allele, compared to 8% of the diseased subjects.

The P-value is a test of this hypothesis: “What is the probability that the observed difference in rs1234 minor allele frequency between non-sufferers and sufferers (5% cf. 8%) was due to chance?”

What this misses is that I tested 200,000 hypotheses. So a more meaningful test would answer, “When 200,000 features are queried, what is the probability that the most prominent difference in allele frequency that was observed (5% cf. 8%) was due to chance?”

* Is this restatement of ben g’s boating-trip analogy correct?

* Is the interpretation of P-values for GWASs fraught with problems for additional, independent reasons, as well?

p-terJune 24th, 2009 at 9:30 amTake the WTCCC GWAS of bipolar disorder. Bipolar is a heterogeneous disorder, and individuals with a bipolar diagnosis can present with very different symptoms. This is a fundamental biological problem that needs to be addressed before you can accurately determine power and thresholds for statistical significance. It really wasn’t surprising nothing exciting came of it.oh ok, yes, in this case i totally agree. I’m not an expert in bipolar disorder, but I do think a lot of these psychiatric diseaes are not well defined, and people are likely lumping together things that have entirely distinct genetic etiologies. the problem is how to know this ahead of time (perhaps better phenotyping–ie performing association studies on a number of symptoms, rather than the disease diagnosis itself–would be helpful)

DerekJune 24th, 2009 at 10:17 amP-ter, I totally agree with you; the problem is it becomes more difficult to convince researchers of the validity of findings in studies of a specific, well-defined measured phenotype when subsequent reports claim results are inconclusive- of course they are, statistically significant replications shouldn’t arise when the phenotype is not standardized in the first place. This is fundamental, even before worrying about all the problems of statistical interpretation that are being discussed here. When such studies don’t address the correct question from the onset, it just makes it more difficult to gain support for carefully designed studies which would hopefully help clear all the mess up. By not taking this seriously and dismissing what might be important breakthroughs, psychiatric geneticists are shooting themselves in the foot when they should be embracing a whole new opportunity to get to the bottom of what are very complex biological concepts. We still do not know nearly enough about psychiatric endophenotypes, and they are not as researched as they undoubtedly should be.

howserJune 24th, 2009 at 2:35 pmDerek and p-ter,

I agree with your general line of thought — I’m all for rigorous subphenotyping in GWAS, especially for psychiatric disorders — but I’m not convinced that lumping together different disorders is crazy, at least in a first-pass analysis. For example, do we really know that these disorders have “entirely distinct genetic etiologies”? I’m no expert on psychiatric genetics, but it seems plausible that distinct disorders could share some genetic risk factors; this is certainly true of autoimmune diseases, to make a loose analogy. If this were the case, an approach like the WTCCC’s would make some sense: they would have been well-powered to find such effects, if they existed, and a negative finding in this setting is still informative.

To address your specific comment, Derek: It seems to me that the problem isn’t lumping together phenotypes, per se, but comparing “lumped” studies to those with clearer phenotype definitions. Conflating the two, or looking for mutual agreement, is simply an interpretive fallacy.

ben gJune 24th, 2009 at 2:42 pmdistinct disorders could share some genetic risk factorsmultivariate genetic analyses suggest that they do. depression and anxiety disorders for example.

howserJune 24th, 2009 at 2:55 pmAMac,

Is the interpretation of P-values for GWASs fraught with problems for additional, independent reasons, as well?Here’s one more issue with using p-values in GWAS: they are influenced by an implicit effect size distribution that varies with minor allele frequency. Specifically, p-values (under the standard trend test for SNPs in GWAS) follow the same ranking that would be obtained by performing a Bayesian analysis with a prior distribution that allows larger effects at rarer SNPs.

This may not strike you as a bad thing at first glance: simple popgen arguments about purifying selection suggest that rare alleles should be able to exist with large effect sizes that wouldn’t be sustainable at higher frequency. The problem is that most people don’t even know about this implicit dependency of effect size on minor allele frequency, and even those who do can’t change the form of the dependency. By contrast, an explicit Bayesian approach has the advantage of forcing people to be forthright about the assumptions underlying their analysis, and it also gives them the freedom to change the form of the allele frequency vs. effect size relationship (for example, we might simply want to know that our results are robust to a range of plausible assumptions that we might make about this relationship).

On a practical note, most of this doesn’t really matter at common SNPs, but it could become increasingly important as studies start to probe more of the rare variation in the genome.

p-terJune 24th, 2009 at 4:26 pmhowser,

yes, i agree, the impact of lumping together of phenotypes depends on how shared the actual genetic factors are, which of course isn’t known ahead of time (and for the record, I don’t think the WTCCC approach to bipolar was unreasonable; it’s of course only after the fact that we get to speculate about why they didn’t find much). however, in general, I think a lot of effort should be put into better phenotyping.

howserJune 24th, 2009 at 5:41 pmI think a lot of effort should be put into better phenotyping.Certainly. I would add that a lot of effort should also be devoted to

modelingthe richer phenotype information that we hope to collect — we aren’t stuck with a simple choice between lumping phenotypes together or treating them all separately. As we begin to reach up past the low-hanging fruit in GWAS, it will be essential to account for the correlation structure between subphenotypes, and possibly to model directional relationships among them as well.NeuroskepticJune 25th, 2009 at 2:35 amOn a related note, PLoS One have just run a paper arguing that the more popular a field is, the less reliable are the results in it. The example they use is from molecular biology but it is clearly relevant to genetics as well. Every field has a “gene-de-jour” – in psychiatry it was 5HTTLPR at least until a few days ago.

DerekJune 25th, 2009 at 7:25 amCertainly agree with the comments; if it was easy to sub-classify psychiatric disorders, it would have been done a long time ago. And yes, there will certainly be common risk factors among psychiatric, and indeed somatic, phenotypes (those affected by the HPA axis for example). I’m guessing this is where ‘favorite’ candidate genes came from historically. And yes, Howser, my feeling is exactly what you state, that broader-defined phenotypes should not be combined with more specific phenotype dimensions in an analytical sense. It just further muddies the already pretty muddy water. I don’t work with GWAS data (so please forgive my ignorance), but it looks as though the original data is now being applied to answer more specific phenotype-related questions, which seems pretty sensible to me.

Maybe the reason there are ‘genes-de-jour’ is because there is actually some evidence behind them. What this leads to though is for every research group who has a sample set with a cobbled-together phenotype measure to try and replicate, and then boldly claim that there’s no association when they don’t find anything. This isn’t helped when retrospective assessments of the literature, such as meta-analyses, use these data and give these types of findings more weight than they should. In the case of the 5HTTLPR, there’s such good evidence from other fields that it is linked to stress and depression (I’m thinking of the work in primates for example), that I don’t think it should be dismissed off the back of few non-analogous studies that don’t replicate. On the plus-side, maybe it will stem the tide of poorly-conceived replication attempts.

p-terJune 25th, 2009 at 8:41 amAnd yes, Howser, my feeling is exactly what you state, that broader-defined phenotypes should not be combined with more specific phenotype dimensions in an analytical sensewait, i thought everyone agreed that more phenotypes

shouldbe included. right now the proper way to analyse these correlated phenotypes don’t exist, but I’m sure they will soon, and some methods (eg. bayesian networks) might actually allow one to come up with causal relationships between them.howserJune 25th, 2009 at 1:46 pmAnd yes, Howser, my feeling is exactly what you state, that broader-defined phenotypes should not be combined with more specific phenotype dimensions in an analytical sensewait, i thought everyone agreed that more phenotypes should be included.I think Derek is referring to the practice of doing one study on a well-defined phenotype and another on a broad, murky phenotype, then claiming a lack of replication when the results don’t line up. We all agree that additional phenotype information in any single study can only help, assuming that we’re able to model it in a sensible way.