Sibling design or bust for polygenic risk scores

Unless you have been sleeping under a rock you are aware that there are lots of constant discussions about “polygenic risk scores” (PRS) and residual “population stratification.” The basic issue here is that when you have a risk profile generated by lots of smaller genetic effects that you sum up together the inference of those genetic effects can be skewed and shaped by correlations between the environment and genome-wide ancestry. To “correct” for this researchers traditionally look at the relatedness between individuals, and their population identity, and “account” for that in the model (an older and more primitive method is simply to remove outliers from homogeneous populations and assume there is no stratification).

A new preprint, Demographic history impacts stratification in polygenic scores, suggests limitations in current methods:

Large genome-wide association studies (GWAS) have identified many loci exhibiting small but statistically significant associations with complex traits and disease risk. However, control of population stratification continues to be a limiting factor, particularly when calculating polygenic scores where subtle biases can cumulatively lead to large errors. We simulated GWAS under realistic models of demographic history to study the effect of residual stratification in large GWAS. We show that when population structure is recent, it cannot be fully corrected using principal components based on common variants—the standard approach—because common variants are uninformative about recent demographic history. Consequently, polygenic scores calculated from such GWAS results are biased in that they recapitulate non-genetic environmental structure. Principal components calculated from rare variants or identity-by-descent segments largely correct for this structure if environmental effects are smooth. However, even these corrections are not effective for local or batch effects. While sibling-based association tests are immune to stratification, the hybrid approach of ascertaining variants in a standard GWAS and then re-estimating effect sizes in siblings reduces but does not eliminate bias. Finally, we show that rare variant burden tests are relatively robust to stratification. Our results demonstrate that the effect of population stratification on GWAS and polygenic scores depends not only on the frequencies of tested variants and the distribution of environmental effects but also on the demographic history of the population.

The authors simulated the level of population stratification within the British…which is a very genetically homogeneous group on a worldwide scale. But even hear looking at common variants, as you’d find on most SNP-arrays, there are serious problems with uncorrected stratification. The logic that rare variants capture recent structure, so those demographic scenarios are not accounted for, totally makes sense.

Really the only way to fix this situation is sibling-design. Researchers who are curious about quantitative traits need to assemble as many sibling cohorts as they can, and look there. Even using SNPs identified elsewhere may cause issues.

Let the genomic die fly!


A new “polygenic risk score” (PRS) paper is making some waves, Polygenic Prediction of Weight and Obesity Trajectories from Birth to Adulthood. Since it is open access I suggest you read it.

But basically, they took ~2 million common variants (there are about ~100 million common variants in the world population) in ~300,000 individuals in 4 cohorts, and used it to predict weight. A genome-wide polygenic score statistic. The correlation with BMI of the score is 0.29. This is pretty modest. But it seems to me that the biggest and most important finding is that it seems to capture a lot of the people at the tails of the distribution.

I’m becoming more and more convinced that the best things these PRS scores can do in the near-term is to identify people who are possibly at these tails. In a complex trait context, the tails are where for diseases a lot of the people who are going to have issues later in life exist. People with BMI in the range 25-30 may have a modest increase in risks, but someone who is very obese, with BMI above 35, is at much greater risk. Over 40% of the people in the top decile here were obese. Only 10% of people in the bottom decile were.

This research comes out of the context of earlier work on the heritability of BMI. It’s around 0.75 or so. That means it runs in families. Combined with the fact that in the recent past, or in other nations, there is a great variation in median size and distribution, one can intuit that genetic dispositions and environmental context both help explain the variation we see around us. The modern American environment is clearly obesogenic. When most of the American population were involved in physical jobs on farms the environmental context was very different.

Over the next few years, there risk scores for BMI will get better, and expand to other populations. One thing that some people are pointing out is that we know it’s heritable, so why not just look at your family? As many of you know, Mendelian segregation means that siblings may have quite different risk profiles on the genomic level. Polygenic risk score prediction is I think going to be extremely interesting and informative in the case of traits which are known to be found within families across generations (e.g., autism), but don’t seem to impact everyone. Perhaps we’ll find for a given characteristic expression is random, due to some life event or cofactor such as infection. Or perhaps we’ll find that differences among siblings have some genetic basis in variants inherited from parents?

Addendum: One of the authors, Sek Kathiresan, has been answering questions on Twitter.

Not happening at genomic speed: diversification of GWAS panels

 
One of the things that is evident and the norm when you are interested in genetics and genomics is that things happen fast. There are some sciences which proceed at a normal and conventional pace. But, because genomics is fundamentally driven by the synergy of two technologies, modern automated sequencing, and computation, the field has been moving at faster than the speech of light. A single whole genome sequence is now cheaper than $1,000, whereas the first whole genome 20 years ago cost $3,000,000,000!

People who point to a paper in 2010…well, in genomics that’s ancient history. Take a look at the initial HapMap papers from the mid-2000s if you want to have a laugh!

But, there’s one area that it seems “genomic speed” hasn’t applied: and that’s the attempts to increase population diversification necessary in GWAS panels to maximize insight. The figure to the right is from a new preprint, Current clinical use of polygenic scores will risk exacerbating health disparities. To my surprise, over the last few years, the proportion of people of European ancestry, which mostly means Northwest European ancestry, in genome-wide association studies has actually increased. The absolute number increases are still heartening, as a a lot of the low-hanging fruit can probably be picked at sample sizes of thousands.

Read More