Another genetics of skin color review…

Share on FacebookShare on Google+Email this to someoneTweet about this on Twitter


here. Like the study Razib linked to a couple days ago, this one looks for signatures of selection in a number of genes suspected to play a role in the generation of natural human skin color variation. And also like the previous study, they find that different genes are implicated in derived light skin color of east asians and northern europeans (see the figure on the left for a crude representation of this).

Labels: ,

6 Comments

  1. I don’t have access to the paper, but do they use Tajima’s D to find selection using the hapmap data? I have seen one of these authors present such an analysis. Using Tajima’s D on hapmap data means you don’t understand how site frequency spectra work.

  2. they use three statistics– Fst, tajima’s D, and the log of the ratio of heterozygotsity.  
     
    to judge significance, they compare the statistics in their genes to a number of windows throughout the genome. this should probably be a decent control for demographics, but the word “ascertainment” never appears in the paper.  
     
    I didn’t really think this through– I imagine using tajima’s D on hapmap data is useless, but what about the other two statistics?

  3. Tajima’s D isn’t useless on the HapMap data-set – you just use an empirical distribution (i.e. the genome-wide data-set) to estimate significance rather than the standard theoretical distribution (which has always been of uncertain utility anyway, since the underlying demographic assumptions are clearly wrong for humans). Assuming that all loci are affected equally by demography and only a subset of loci have been subject to recent positive selection, this should be a valid approach for detecting recent sweeps. 
     
    A bigger problem, IMO, is that Tajima’s and other tests are only useful when the sweep has proceeded to close to fixation. It’s likely that many variants under recent positive selection aren’t anywhere near fixation – Tajima’s won’t see these at all. However, these should be detected by the linkage-based methods developed recently (see here and here).

  4. Never use Tajima’s D without doing a coalescent simulation to determine the distribution given your expected demographic model. I’ve had that hammered into my head enough over the past year, that I can’t get it out of my head. This applies no matter where your data come from (resequencing or previously identified SNPs). I’m not a big fan of using the emperical distribution of the data to find outliers. 
     
    But my criticism of using Tajima’s D on hapmap data is because of the nature of the SNPs — they are determined a priori and suffer from ascertainment bias. This means that rare polymorphisms tend to be excluded from the data. Tajima’s D detects selective sweeps by looking for an excess of rare polymorphisms, so you can see how this would introduce a problem. Also, I think you’ll end up finding more evidence for selection in the populations from which you found your SNPs. 
     
    I like the linkage based analysis that they’re using, as that seems to be the best way to approach the hapmap data.

  5. I’m not a huge fan of the outlier approach either – but in the absence of the information about the demographic history of humans required to generate appropriate theoretical distributions (via coalescent sims) it seems like a reasonable halfway house. In a few years we’ll have the requisite population genetic data to reconstruct human demographic history in some detail, and then we can start using theoretical distributions with confidence. Until then, outliers it is. 
     
    The linkage-based approaches seem solid, although I’ve run into some trouble using them in practice – local variation in recombination rate creates a great deal of noise (at least, more than I expected to see). The first-pass genome scans currently being published are cluttered with false positives, and they’re also missing at least a few important regions – I know this because I’ve found at least one clear signal using my own algorithms that’s been missed by all the published genome-wide scans. 
     
    But imagine the situation in ten to twenty years time – we should have genome-wide complete resequencing data for enough humans from enough ethnic groups to pull out virtually every region that’s been subject to recent local or global selection. By then, the algorithms will be good enough to dissect out selection with much higher accuracy, and genome annotation will be good enough to assign functions to nearly all of those loci. We’ll basically have a rough history of nearly every selective pressure that humans have faced – with estimates of age and selective strength – for the last few 100 KY. Exciting times…

  6. Tajima’s D isn’t useless on the HapMap data-set – you just use an empirical distribution (i.e. the genome-wide data-set) to estimate significance rather than the standard theoretical distribution 
     
    like RPM says, it’s not the demography that’s a problem, it’s ascertainment. there was no standard for including SNPs in the hapmap, so the frequency spectrum for SNPs in different regions might be variable for no reason (other than ascertainment).  
     
    local variation in recombination rate creates a great deal of noise 
     
    check out voight et al (2006). in plos biology. the use of the two haplotypes as internal controls is a way around this problem.  
     
    I’m a big fan of outlier-based approaches–in disease genetics, the move of parameter-based to non-parametric statistics was a revolution; i predict it will be the same in popgen.

a