Reanalysing gene expression differences between populations

Share on FacebookShare on Google+Email this to someoneTweet about this on Twitter

Early this year, I commented on a paper showing large differences in gene expression between Europeans and Asians. A letter to the editor in this week’s Nature Genetics points out a major flaw in part of their analyses.

Expression arrays are tricky tools– they don’t provide a measure of absolute mRNA levels, but rather an output that corresponds to the binding affinities of the mRNA, the ambient conditions, the way the mRNA was handled, and absolute mRNA levels (and a billion and one other things). Study design is extremely important in isolating the effect of the variable you’re really interested in (mRNA levels), and it’s very difficult, if not impossible, to really compare the raw data from one array experiment with that from another.

The error the authors made is an unfortunate (and pretty elementary) one– they did the array experiments on the Europeans population in 2003-2004, and the array experiments on the Asian population in 2005-2006 (they actually erroneously claimed the samples were randomized with regard to year in the paper, which would explain why it got past peer review). This means that any variation between the European and Asian populations is perfectly confounded with variation between those two batches. There’s no way to correct for this; any difference in mean expression between the two populations is due to a mixture of the “real” effects and the bias from the batch effect. That’s a bitch.

Luckily, the authors also did additional analyses (as they point out in their reply)– they looked at the correlation of expression levels with genotypes. In the figure, you see the population distributions of expression for a given gene on the left, and the within-genotype levels on the right. There doesn’t seem to be much of a differences between the two populations within each genotype class, but the population difference is explained almost entirely by the difference in allele frequency between the two populations.

So was their claim of finding nearly 25% of all genes differentially expressed between the two populations likely wrong? Yes. But their conclusion that allele frequency differences play a role in expression differences between populations stands– it will just take a better-designed study to quantify the effect.

Labels:

4 Comments

  1. Could you expand a little on what introduces the batch variation? I’m assuming there’s some technical issue here I’m missing because I don’t know anything about expression chips. It just seems to me that if there’s a lot of batch variation that this should also cause problems for interpreting results gathered from any sigle batch.

  2. It just seems to me that if there’s a lot of batch variation that this should also cause problems for interpreting results gathered from any sigle batch. 
     
    this is true, which is why study design is so important (true for any experiment, of course, but people have learned the hard way about microarrays in particular).  
     
    the relevant variables are impossible to count– atmospheric pressure, differences in the chip coming from the manufacturer, time of day, person amplifying the RNA, machine used to read the intensity…I could go on. The key is to control the variables that are easy to control and randomize those you can’t (including those you havn’t thought of).  
     
    Then, when you compare condition A to condition B (note microarrays are *not* very useful as measures of absolute gene expression, only of *relative* gene expression), the unknown variables only contribute to the within-class variance (not the mean), and intensity(A)/intensity(B) is an unbiased estimate of the relative expression of the gene in the two conditions. any systematic study design issue (you did all the arrays for condition A at 9AM and all the arrays for condition B at 4PM, for example) ends up confounded with the condition effect, which is what you don’t want.  
     
    this isn’t an issue with technologies like SAGE, which can get at absolute levels of expression.

  3. “The key is to control the variables that are easy to control and randomize those you can’t”: that’s a reasonable working definition of Science.

  4. Thanks for the details, p-ter. Yeesh, anything that’s such a bitch to work with is probably (hopefully?) going to be a fairly short-lived technology.

a