Sibling design or bust for polygenic risk scores

Unless you have been sleeping under a rock you are aware that there are lots of constant discussions about “polygenic risk scores” (PRS) and residual “population stratification.” The basic issue here is that when you have a risk profile generated by lots of smaller genetic effects that you sum up together the inference of those genetic effects can be skewed and shaped by correlations between the environment and genome-wide ancestry. To “correct” for this researchers traditionally look at the relatedness between individuals, and their population identity, and “account” for that in the model (an older and more primitive method is simply to remove outliers from homogeneous populations and assume there is no stratification).

A new preprint, Demographic history impacts stratification in polygenic scores, suggests limitations in current methods:

Large genome-wide association studies (GWAS) have identified many loci exhibiting small but statistically significant associations with complex traits and disease risk. However, control of population stratification continues to be a limiting factor, particularly when calculating polygenic scores where subtle biases can cumulatively lead to large errors. We simulated GWAS under realistic models of demographic history to study the effect of residual stratification in large GWAS. We show that when population structure is recent, it cannot be fully corrected using principal components based on common variants—the standard approach—because common variants are uninformative about recent demographic history. Consequently, polygenic scores calculated from such GWAS results are biased in that they recapitulate non-genetic environmental structure. Principal components calculated from rare variants or identity-by-descent segments largely correct for this structure if environmental effects are smooth. However, even these corrections are not effective for local or batch effects. While sibling-based association tests are immune to stratification, the hybrid approach of ascertaining variants in a standard GWAS and then re-estimating effect sizes in siblings reduces but does not eliminate bias. Finally, we show that rare variant burden tests are relatively robust to stratification. Our results demonstrate that the effect of population stratification on GWAS and polygenic scores depends not only on the frequencies of tested variants and the distribution of environmental effects but also on the demographic history of the population.

The authors simulated the level of population stratification within the British…which is a very genetically homogeneous group on a worldwide scale. But even hear looking at common variants, as you’d find on most SNP-arrays, there are serious problems with uncorrected stratification. The logic that rare variants capture recent structure, so those demographic scenarios are not accounted for, totally makes sense.

Really the only way to fix this situation is sibling-design. Researchers who are curious about quantitative traits need to assemble as many sibling cohorts as they can, and look there. Even using SNPs identified elsewhere may cause issues.


One thought on “Sibling design or bust for polygenic risk scores

  1. So I a train polygenic model for cognitive ability on the UKBiobank dataset. This paper suggests information about UK population structure will leak in *if I apply the model to UK subjects*. But what if I apply the model to US or Canadian (European-descent) subjects. Yes SNPs that reflect UK population structure will continue to contribute to the polygenic score, but this should introduce noise, not bias.

    Or should we assume that UK population structure is recapitulated in US/Canada population structure, so the polygenic score will be biased after all?

Comments are closed.