An argument for searching for rare variants in human disease

Share on FacebookShare on Google+Email this to someoneTweet about this on Twitter

Based on the comments on my previous post, I’m going to lay out an argument which I find reasonable for sequencing studies in human disease:

Let’s follow Goldstein’s back-of-the-envelope calculations: assume there are ~100K polymorphisms (assuming Goldstein isn’t making the mistake I attribute to him, this includes polymorphisms both common and rare) that contribute to human height, that we’ve found the ones that account for the largest fractions of the variance, and that these fractions of variance follow an exponential distribution.

Now, assume you have assembled a cohort of 5000 individuals and done a genome-wide association study using common SNPs. You find some interesting things, but you want more. Now, you have two choices: sequence those 5000 individuals to look for rarer variation, or increase sample size to 20,000 and perform another association study using the same set of common polymorphisms.

As Daniel Macarthur points out, you’ve not yet sucked every drop of marrow out of those 5000 individuals: there are presumably some (many?) rarer SNPs that have modest effect sizes (in sense 2 from this post), and thus account for measurable (though still small) fractions of the variance in your trait. Those are low-hanging fruit for you to find if you pony up the cash for some sequencing (the price of which keeps dropping). This is especially true if there are more rare variants than common ones that influence the trait, as is likely the case (there’s more rare variation than common variation overall). So instead of spending on scaling up your sample size, spend on sequencing, and have impact now.

Is this along the lines the argument Goldstein is making? I don’t really think so, but welcome comment. In any case, the choice above is somewhat arbitrary–if you want to look for very rare variation, you need a sample size larger than 5000 anyways, and if you’re sequencing, you’re obviously not just going to look at the rare variants since the common ones come along for free.

Labels:

7 Comments

  1. Seems to me that if you think that the causal alleles are new, you’d want to spend your money doing the same SNP panel on siblings. You’d be betting that you can bootstrap pedigree information to find rare haplotypes linked to your phenotypic variation.  
     
    You could spend the same money resequencing, in which case you’ll be betting that the rare causal alleles aren’t linked to rare haplotypes you could find with a SNP panel.  
     
    I’m assuming you’ll be springing for the resequencing once you find gene candidates anyway.

  2. what are the main reasons why whole-genome sequencing is not being employed already?

  3. it’s still way too expensive. I think by the end of this year Illumina should be able to do full resequencing of a human genome at acceptable coverage for $10K (maybe less?). a SNP chip costs $250.

  4. A lot of the cost of any human study is recruiting, getting consent forms signed, etc. It would make sense to write the consent forms in such a way to make sure that you could go back and do ever more intensive sequencing of the sample DNA every so often as it gets cheaper.

  5. we’ve found the ones that account for the largest fractions of the variance, and that these fractions of variance follow an exponential distribution 
     
    Not sure about the first part of that sentence – we haven’t necessarily found low-frequency SNPs that account for a substantial chunk of the variance. For instance, a 1% variant would be largely invisible to current GWAS, even if it had a large effect size (let’s say a per-allele increase of 1 SD, or 2 inches of height, which would mean it explained 2% of the total variance in height – very respectable compared to most common variants). No doubt we’ve already picked up some of these through tagging (in which case they would look like common SNPs with small effects), but there would be plenty that we’ve missed. 
     
    So we have a whole swathe of variants for which we don’t currently know the empirical distribution of effect sizes, but for which there are fairly good theoretical reasons (i.e. selection against common risk variants) to expect larger per-allele effects. Perhaps Goldstein believes that the per-allele effect sizes of these rare variants won’t follow the same exponential distribution, in which case fewer overall variants would be required and sequencing will have a dramatically better yield than GWAS. I don’t know enough quant genetics to know if this is at all plausible. 
     
    As for Goldstein’s overall argument – my impression is that he wants the money currently going into ever-larger GWAS (now approaching 100,000 individuals for some disease/traits, with an overall cost probably exceeding $1000 per individual counting labour, infrastructure etc.) to be diverted instead into both the development and the application of sequencing technology.  
     
    I’m fairly skeptical about this argument; there isn’t necessarily a conflict between the two approaches. Performing GWAS while sequencing technology matures (which is happening incredibly rapidly anyway) seems a good way to go; it provides some yield in terms of risk variants, and it justifies to funding bodies the collection of the large, well-phenotyped sample sets that will be required for sequencing anyway. (And related to Steve’s point: most GWAS consent forms now include broad consent for other analyses including whole-genome sequencing.) 
     
    One last point: 
     
    if you want to look for very rare variation, you need a sample size larger than 5000 anyways 
     
    Not necessarily. This is definitely true if you want to obtain significance for a single variant, but assuming that rare disease-causing variants tend to cluster in disease-related genes you don’t necessarily need to do this; you can just treat each protein-coding gene as a unit and then aggregate all of the low-frequency coding/regulatory variants within it and treat them as a single common variant. This is insensitive for all sorts of reasons, but it might pick up some low-hanging fruit while you wait to add more patients to your cohort. 
     
    The power of this type of approach (under various assumptions) is theoretically pretty good even with comparable sample sizes to current GWAS – see e.g. Figure 2 in this review.

  6. For instance, a 1% variant would be largely invisible to current GWAS, even if it had a large effect size (let’s say a per-allele increase of 1 SD, or 2 inches of height, which would mean it explained 2% of the total variance in height – very respectable compared to most common variants) 
     
    I see. Post-1K Genomes Project, the next generation of SNP chips will include things of this frequency, right? so to some extent, you don’t need to sequence your cohort to get these?

  7. Sure – chips are already being designed based on 1KG pilot data. Given Illumina’s speedy production cycle I’d guess these will be available in late 2009, but bear in mind that pilot data = low coverage sequencing of 60 individuals per HapMap population, so it will still be missing a pretty large proportion of the 0.1-1% variants. 
     
    To get a better handle on those we’ll have to wait for the final 1KG data (500 individuals per pop, including high coverage of all exons), which I guess would be converted into chips some time in 2010. Of course, if sequencing costs keep falling at current rates we may never end up using those chips…

a