Thursday, July 06, 2006

Linkage versus association: a mini-primer   posted by JP @ 7/06/2006 11:32:00 AM

So I recently tried to google my way to a decent biologist-but-not-statistical-geneticist summary of the difference in two approaches to gene mapping: linkage and association. I didn't really find anything that did a good job of expressing what, in my opinion, needs to be expressed. So here's my attempt to fill that void:

I. Linkage mapping

The principle of a linkage study is the following: if a disease runs in a family, one could look for genetic markers that run exactly the same way in the family (from grandma, to dad, to me, for example). If we find one, we assume the gene that causes the disease is somewhere in the same area of the genome as the marker.

That's all.

In theory, one could genotype generations and generations of a family, and follow the inheritance of the disease. That is, however, not practical, as people tend to do bothersome things like die, and digging up bodies to get DNA samples is unlikely to get past an ethical review (and even if it were ethical, it's tough to know the phenotype of a long-dead great-aunt).

In practice, a popular design is to genotype affected siblings, and use the following logic-- for a given bit of chromosome, each sibling gets two copies, one from mom and one from dad. If the two have inherited the same bits from each parent, the area is more likely to be involved in the disease than if each sibling inherits different bits.

I purposely didn't use the word "gene" above, because we're not talking about testing specific alleles-- we're talking about chromosomal regions. And that brings me to the first limitation of linkage mapping-- the resolution is low. That is, the chunks of chromosome we're talking about here are millions and millions of base pairs long (recombination over a couple generations doesn't break chromosomes up that much). So even after getting a strong signal, there are generally a number of genes in the area that must be painstakingly tested. This could take years.

A couple other limitations-- the strongest linkage signals tend to come from recessive and highly-penetrant (and thus generally rare) diseases. Why is this? I noted above that the goal is to find regions where two affected siblings have received the same chromosomal segments from each parents, and these are the conditions that ensure that (for those in the know, these are the conditions that lead to the strongest distortion of the IBD vector).

So...linkage is the best approach to detect regions involved in recessive, highly penetrant diseases, and can narrow down the the search for causal variants to a few million base pairs, in general.

II. Association studies

The principle of an association study is also simple-- gather some people with a disease and some people with out a disease, and look to see if a certain allele (or genotype) is present more often in the cases than the controls.

If the allele plays a role in causing the disease, or is correlated with a causal allele, it will have a higher frequency in the case population than the control population.

After a linkage study, one nominates "candidate genes" in the region under the linkage signal, and performs an association study on alleles in the genes. In this way, a specific gene, or even a specific allele, can be identified as playing a possible causal role in the disease. The resolution is much higher, but it was previously implausible to perform these sorts of studies on regions much larger than a couple genes.

However, with the HapMap and the technology to genotype hundreds of thousands of alleles in parallel, it's now possible to perform association studies on the level of the whole genome. This would essentially skip the step of a linkage scan.

What are the limitations of this approach? First, many different mutations in a gene might lead to a disease. In linkage studies, this doesn't pose a problem, the different mutations still in the same region. But in population-level association studies, the effect of each mutation is diluted by the presence of the others.

Further, case-control studies are always subject to problems like population substructure that family-based studies don't have (for those who are really interested in these sorts of questions, see also family-based association studies).

But to detect low-penetrance alleles in complex disease (or any complex phenotype, really), genome-wide association studies will doubtless provide unprecedented views of the contributions of genetic factors (as they already are).