Sibling design or bust for polygenic risk scores

Unless you have been sleeping under a rock you are aware that there are lots of constant discussions about “polygenic risk scores” (PRS) and residual “population stratification.” The basic issue here is that when you have a risk profile generated by lots of smaller genetic effects that you sum up together the inference of those genetic effects can be skewed and shaped by correlations between the environment and genome-wide ancestry. To “correct” for this researchers traditionally look at the relatedness between individuals, and their population identity, and “account” for that in the model (an older and more primitive method is simply to remove outliers from homogeneous populations and assume there is no stratification).

A new preprint, Demographic history impacts stratification in polygenic scores, suggests limitations in current methods:

Large genome-wide association studies (GWAS) have identified many loci exhibiting small but statistically significant associations with complex traits and disease risk. However, control of population stratification continues to be a limiting factor, particularly when calculating polygenic scores where subtle biases can cumulatively lead to large errors. We simulated GWAS under realistic models of demographic history to study the effect of residual stratification in large GWAS. We show that when population structure is recent, it cannot be fully corrected using principal components based on common variants—the standard approach—because common variants are uninformative about recent demographic history. Consequently, polygenic scores calculated from such GWAS results are biased in that they recapitulate non-genetic environmental structure. Principal components calculated from rare variants or identity-by-descent segments largely correct for this structure if environmental effects are smooth. However, even these corrections are not effective for local or batch effects. While sibling-based association tests are immune to stratification, the hybrid approach of ascertaining variants in a standard GWAS and then re-estimating effect sizes in siblings reduces but does not eliminate bias. Finally, we show that rare variant burden tests are relatively robust to stratification. Our results demonstrate that the effect of population stratification on GWAS and polygenic scores depends not only on the frequencies of tested variants and the distribution of environmental effects but also on the demographic history of the population.

The authors simulated the level of population stratification within the British…which is a very genetically homogeneous group on a worldwide scale. But even hear looking at common variants, as you’d find on most SNP-arrays, there are serious problems with uncorrected stratification. The logic that rare variants capture recent structure, so those demographic scenarios are not accounted for, totally makes sense.

Really the only way to fix this situation is sibling-design. Researchers who are curious about quantitative traits need to assemble as many sibling cohorts as they can, and look there. Even using SNPs identified elsewhere may cause issues.

The arc of selection on polygenic traits

A very important new preprint, Polygenic adaptation after a sudden change in environment:

Polygenic adaptation in response to selection on quantitative traits is thought to be ubiquitous in humans and other species, yet this mode of adaptation remains poorly understood. We investigate the dynamics of this process, assuming that a sudden change in environment shifts the optimal value of a highly polygenic quantitative trait. We find that when the shift is not too large relative to the genetic variance in the trait and this variance arises from segregating loci with small to moderate effect sizes (defined in terms of the selection acting on them before the shift), the mean phenotype’s approach to the new optimum is well approximated by a rapid exponential process first described by Lande (1976). In contrast, when the shift is larger or large effect loci contribute substantially to genetic variance, the initially rapid approach is succeeded by a much slower one. In either case, the underlying changes to allele frequencies exhibit different behaviors short and long-term. Over the short term, strong directional selection on the trait introduces small differences between the frequencies of minor alleles whose effects are aligned with the shift in optimum versus those with effects in the opposite direction. The phenotypic effects of these differences are dominated by contributions from alleles with moderate and large effects, and cumulatively, these effects push the mean phenotype close to the new optimum. Over the longer term, weak directional selection on the trait can amplify the expected frequency differences between opposite alleles; however, since the mean phenotype is close to the new optimum, alleles are mainly affected by stabilizing selection on the trait. Consequently, the frequency differences between opposite alleles translate into small differences in their probabilities of fixation, and the short-term phenotypic contributions of large effect alleles are largely supplanted by contributions of fixed, moderate ones. This process takes on the order of ~4Ne generations (where Ne is the effective population size), after which the steady state architecture of genetic variation around the new optimum is restored.

There is a lot to take in in this preprint. If you jump to the discussion it frames its importance pretty well. A lot of selection is probably quantitative and polygenic, but a lot of the empirical investigation has been of the sweep of single-locus alleles that rise up to fixation. It strikes me that some of the results here resemble R. A. Fisher’s geometric model of adaptation (The Genetical Theory is the first citation).

I read the whole preprint, but I didn’t double-check the formulae. I have neither the ability or the time. This is where I really which there was a lot of visible post-publication review. I am very interested in the topic under discussion, but it is outside of the purview of my competency, but I know enough that I would probably benefit from extensive comments by others.

This part of the discussion jumped out at me since it echoes my thoughts:

Another implication of our results pertains to the search for the genetic basis of human adaptation, as well as adaptation in other species. Efforts to uncover the identity of individual adaptive genetic changes on the human lineage were guided by the notion that their identity would offer insight into what “made us human”. Under the plausible assumption that many adaptive changes on the human lineage arose from selection on complex, quantitative traits, this approach may not be as informative as it appears (15, 19). Our results indicate that after a shift in the optimal trait value, the number of fixations of alleles whose effects are aligned to the shift are nearly equal to the number of alleles that are opposed (Fig. 6).

Not happening at genomic speed: diversification of GWAS panels

 
One of the things that is evident and the norm when you are interested in genetics and genomics is that things happen fast. There are some sciences which proceed at a normal and conventional pace. But, because genomics is fundamentally driven by the synergy of two technologies, modern automated sequencing, and computation, the field has been moving at faster than the speech of light. A single whole genome sequence is now cheaper than $1,000, whereas the first whole genome 20 years ago cost $3,000,000,000!

People who point to a paper in 2010…well, in genomics that’s ancient history. Take a look at the initial HapMap papers from the mid-2000s if you want to have a laugh!

But, there’s one area that it seems “genomic speed” hasn’t applied: and that’s the attempts to increase population diversification necessary in GWAS panels to maximize insight. The figure to the right is from a new preprint, Current clinical use of polygenic scores will risk exacerbating health disparities. To my surprise, over the last few years, the proportion of people of European ancestry, which mostly means Northwest European ancestry, in genome-wide association studies has actually increased. The absolute number increases are still heartening, as a a lot of the low-hanging fruit can probably be picked at sample sizes of thousands.

Read More

Soft & hard selection vs. soft & hard sweeps


When I was talking to Matt Hahn I made a pretty stupid semantic flub, confusing “soft selection” with “soft sweeps.” Matt pointed out that soft/hard selection were terms more appropriate to quantitative genetics rather than population genomics. His viewpoint is defensible, though going back into the literature on soft/selection, e.g., Soft and hard selection revisited, the main thinkers pushing the idea were population geneticists who were also considering ecological questions.*

The strange thing is that I had already known the definitions of hard and soft selection on some level because I had read about them as I was getting confused with hard and soft sweeps! But this was more than ten years ago now, and since then I haven’t given the matter enough thought obviously, as I defaulted back to confusing the two classes of terms, just as I used to.

Matt pointed out that truncation selection is a form of hard selection. All individuals below (or above) a certain phenotype value have a fitness of zero, as they don’t reproduce. In a single locus context, hard selection would involve deleterious lethal alleles, whose impact on the genotype was the same irrespective of ecological context. So in a hard selection, it operates by reducing the fitness of individuals/genotypes to zero.

For soft selection, context matters much more, and you would focus more on relative fitness differences across individuals/genotypes. Some definitions of soft vs. hard selection emphasize that in the former case fitness is defined relative to the local ecological patch, while the latter is a universal estimate. Soft selection does not necessarily operate through the zero fitness value for a genotype, but rather differential fitness. Hard selection can crash your population size. Soft selection does not necessarily do that.

Though I won’t outline the details, one of the originators of the soft/hard selection concept analogized them to density-dependent/independent dynamics in ecology. If you know the ecological models, the correspondence probably is obvious to you.

As for hard and soft sweeps, these are particular terms of relevance to genomics, because genome-wide data has allowed for their detection through the impact they have on the variation in the genome. A “sweep” is a strong selective event that tends to sweep away variation around the focus of selection. A hard sweep begins with a single mutant, and positive selection tends to drive it toward fixation.

A classical example is lactase persistence in Northern Europeans and Northwest South Asians (e.g., Punjabis). The mutation in the LCT gene is the same across a huge swath of Eurasia. And, the region around the genome is also the same, because regions of the genome adjacent to that single mutation increased in frequency as well (they “hitchhiked”). This produces a genetic block of highly reduced diversity since the hard selective sweep increases the frequency of so many variants which are associated with the advantageous one, and may drive to extinction most other competitive variants.

Someone is free to correct me in the comments, but it strikes me that many hard selective sweeps are driven by soft selection. Fitness differentials between those with the advantageous alleles and those without it are not so extreme, and obviously context dependent, even in cases of hard sweeps on a single locus.

The key to understanding soft sweeps is that there isn’t a focus on a singular mutation. Rather, selection can target multiple mutations, which may have the same genetic position, but be embedded within different original gene copies. In fact, soft selection often operates on standing variation, preexistent alleles which were segregating in the population at low frequencies or were totally neutral. Genetic signatures of these events are less striking than those for hard sweeps because there is far less diminishment of diversity, since it’s not the increase in the frequency of a singular mutation and the hitchhiking of its associated flanking genomic region.

Soft sweeps can clearly occur with soft selection. But truncation selection can occur on polygenic traits, so depending on the architecture of the trait (i.e., effect size distribution across the loci) one can imagine them associated with hard selection as well.

Going back to the conversation I had with Matt the reason semantics is important is that terms in population genetics are informationally rich, and lead you down a rabbit-hole of inferences. If population genetics is a toolkit for decomposing reality, then you need to have your tools well categorized and organized. On occasion it is important to rectify the names.

* There are two somewhat related definitions of soft/hard selection. I’ll follow Wallace’s original line here, though I’m not sure they differ that much.