Gene Expression: 8th grade math for the rest of us

Front page

Monday, November 07, 2005

8th grade math for the rest of us posted by Razib @ 11/07/2005 03:26:00 PM

Occasionally I appeal to formalizations or equations on this weblog to illustrate a general verbal principle. I don't do it to obscure or needlessly technicalize a topic of interest, but rather, it is often a neat and dense way to package the information and express precisely the various relations that I am attempting to communicate. Most of the formulas are extremely light on mathematical subtly, and there is little need for a level of fluency beyond what one should have attained in 8th grade algebra. The difficulty is almost always an unfamiliarity with the notation. But, the formulas are condensations of some of the general concepts that I want to communicate to readers of this blog. I know that a substantial proportion of the core 300 readers of this weblog come from backgrounds in the mathematical sciences, and the main hurdle for this subset is simply to map over knowledge from their own fields into biology (though I haven't thrown out diffusion equations here in relation to changes in allele frequency, so I'm not sure how much mapping even needs to be done, seeing as how it is almost always 8th grade algebra on display, with a little basic statistics and probability). But many do not come to this weblog from the mathematical sciences, so I am here to reassure you that you need not skip any equation or formalization, because they are mathematically quite trivial and easy to grasp. For me, formalization and mathematicization in the context of this weblog helps everyone trade in a common currency. It facilitates communication and cuts down on needless verbal confusions. Over the past 3 years I myself have slowly become more and more prone toward formalizing my genetical thinking in a "Wright-Fisher" model, so to speak. In the rough outlines little has changed, but I have gained a great deal in precision and predictivity. As I note above, this gain in precision and predictivity is attained via the most minimal acquisitions of mathematical tools. Many of the models can be easily illustrated with difference equations, which can be confirmed computationally (in MS Excel even!) because of their discrete nature (i.e., don't sweat not taking any courses in differential equations, linear algebra, probability and statistics, or, yes, even calculus! Though I think if you don't have calculus you probably will miss some of the logic and implications).

Godless & I never have gotten around to a "GNXP FAQ." The reasons are rooted in human psychology, this weblog is a hobby, a pastime, and writing an FAQ requires forethought and effort we simply never felt we could spare! Nevertheless, I do link to technical webpages whenever I can when I use an equation, and at this point, I feel I should go over a few formula that I feel are particularly useful, and good currency to have "under your belt" so that one can get beyond vague impressions and intuitions. And I want to emphasize: there really isn't much beyond 8th grade algebra here!

For example, consider the Hardy-Weinberg Equilibrium (HWE)....

p² + 2pq + q² = 1

p = frequency allele ("A") at a locus (def. 4) in the population
q = frequency allele ("a") at a locus in the population
p² = frequency of homozygote A genotype in a population (that is, frequency of p squared, p X p, because you have two copies, AA, at a locus)
q² = frequency of homozygote a genotype in a population
2pq = frequency of heterozygote genotype in a population

G.H. Hardy (most well known to the public because of his collaboration with Ramanujin) thought little of this formula, and didn't understand why it wasn't obvious for biologists! As most of you know from high school biology, if you have two parents who are heterozygotes for a "dominant" and "recessive" allele, that is, they are Aa and Aa, on a locus, and they have offspring, 1/4 of the progeny will exhibit the recessive trait and 3/4 will not (because only the 1/4 who are homozygotes, aa, express the trait). Biologists in the early 20th century actually debated why populations did not exhibit the 3:1 ratio that emerged out of the Punnet Squares. Of course, the key is that the expression of the recessive trait is a function of the frequency of the recessive allele within the population, the 3:1 ratio on a populational level is only valid where p and q are both at frequencies of about 50%.

This model has many assumptions (like all models, otherwise, they wouldn't be models!), but, it works as a good null hypothesis. You might think that it isn't worthwhile to even know this equation, but I disagree, for two reasons....

1) It gives more concreteness to your ability to think about genetic questions, and it is the basis for extrapolation of that thought. Believe it or not, HWE is the starting point for a lot of mathematical population genetic models, and for a reason.

2) A lot of the genetic issues that come up in everyday discourse are amenable to in-your-head-HWE calculations. For example, cystic fibrosis, which is a recessive disease. Or, Rh- status (again, recessive). The payoff in terms of utility is not that great, but the ease of doing the calculation in your head and being able to brush aside verbal qualifications is I think useful. Remember 3 years ago when stories popped up that 'blondes were going extinct' (it seems like it was a hoax)? As our own David Burbridge noted, this seemed a peculiar assertion to make sans massive selection (some selection was implied of course, though little evidence given as to the magnitude of differential fitness), the alleles which caused blondism likely wouldn't disappear from the population, assuming that one models it as a 'recessive' trait (I actually think the recessive-dominant talk confuses more than helps, but that's another post).

So, with the preliminaries over, on to a few trivial issues which I hope to bring to your attention. I want to first excuse myself of any pretense at a high standard of rigor and formality, the threshold for success is low if we get beyond vague verbal platitudes. There are lots of different notations, and I'm mildly constrained by the formatting of HTML (no, I don't want to use MathML, this is not a math weblog, and I don't want to give people bizarre javascript warnings). So bear with me....

Variation

V_p = V_a + V_d + V_e + V_i

V_p = Variation of the phenotype (trait)
V_a = Additive genetic variation
V_d = Dominance genetic variation
V_e = Environmental variation
V_i = Interactional (epistatic) variation

Normally you should see some σ² (where I put the V's) to make it clear that you are talking in terms of statistical parameters of variance. Roughly speaking, all that matters is that the variation of a given phenotype attributable to a subcomponent is being illustrated here. If you want to get into the nitty-gritty of this equation, the last chapter of Principles of Population Genetics is nice, while Introduction to Quantitative Genetics really does more than "introduce" you to the nuances of this formalization. But, on this weblog we have talked about V_a a lot, because that is what is to a large extent responsible for the normal distribution you see on continuous quantitative traits. When we are delving into psychometrics, this really matters. But, to get less contentious, let me use height as an illustrative trait. Clearly there is a median and modal height, an "average," and variation around this average. Some of this is no doubt due to genes, we see rough relationships between parents and their offspring. But the key is that the relationship is rough, there are other components of the variation. As Francis Galton first noted, if you plot offspring height against parental height one notes a correlation, but an imperfect one. V_a, the additive genetic variance, is roughly proportional to the regression coefficient of the line of best fit. In plain English, the similarity in height between a population of parents and their offspring (correct for sex) is due to V_a. V_a on the genetic level can be thought of as constituted by genes of small effect across the loci. Going back to the notation I used for the Punnet Square, instead of just AA, Aa and aa on a locus, imagine multiple loci, each with its own cluster of alleles, and each contributing a small increment independently to the phenotype. A large number of random variables of small effect will tend to result in a modal, that is, most frequent, value that is also intermediate on the range of the distribution (see central limit theorem, I don't know how to put this in plain English well). But here is the key: the V_a, which is basically the "narrow sense heritability" is the populational proportion of variation, it doesn't mean much on the individual level. This is important because a recurrent problem on this weblog is that people have conflated the point and taken a stand on the "Nature vs. Nurture" issue by stating that "I believing that the trait X is mostly genetic/environment." Though the gist of the comment is well taken, the way the phrase is put suggests a misunderstanding of the details of what is being communicated, that is, these values explain populations, but are not necessarily relevant when speaking, or conceiving of, individuals.

Additionally, the calculation of these values is often problematic. I have also left off the important confounding parameters of gene-environment association and gene-environment interaction. The former is basically an increase of the dispersion because of the correlation of particular genotypes to particular environments (the classic example is the increase in feed to extremely productive dairy cows, this obviously increases the difference between the animals in terms of their milk productivity since the environmental variable is different across them). Gene-environment interaction is even more difficult, as it involves the unpredictable reactions of genotypes to varied environments. This is the classic norm of reaction problem. The epistatic gene interactions are also important confounders of the idealized genetic world, as gene-gene interactions break up the calm of the additive independent loci universe. In sum, this idealized model is an utopia which is never attained in the real world. Does that mean we should give up on this formalization?

Developmental psychologist Alison Gopnik noted earlier this year:

Because humans create and change their own environments we don't and can't, in principle, know what the range of possible environments will turn out be. And we don't know how those possible environments might interact with genetics over the course of development to cause a particular distribution of adult traits. This means that we simply don‘t and can't know how much genes contribute to complex human traits in general — the question is incoherent. In a particular case, with a particular specified environment, and a complete developmental history of the causal interactions between the organism and the environment, we might be able to give a causal account of the path from gene to trait. But there is no general answer about how gene and trait are related across all environments.

Frankly, I wonder why Alison Gopnik is not a solid state physicist or a novelist, because these types of assertions make me wonder as to validity of any attempt at a rigorous and predictive intellectuality any softer than molecular biology! Ultimately, such stringent standards for controlling of variables seems to render only the purest of humanities and the hardest of sciences immune from the taint of noise, the former because of its unalloyed reliance on intuition, and the latter because of its recourse toward deterministic models of extremely precise caliber. Why is Alison Gopnik a developmental psychologist? Does her "Theory Theory" stand up to the scrutiny of this standard? I doubt it.

The fact is anyone who is doing work in a noisy and statistical science is very well aware of confounding factors. It is part of the business. But even in the midst of the murkiness of the complex and messy world around us, we attempt to model it as best as we can. This is what Gopnik is doing in her corner of cognitive science. This is what public policy analysts do when they decompose and examine topics of human importance in a rational manner. If the critiques of the likes of Gopnik are taken to their logical conclusion, outside of the physical, and to some extent biophysical, sciences, all we will have recourse to are customs, traditions and a priori values. Is this what Gopnik would want? If you read Curious Minds, you will get a feel for the cultural milieu that Gopnik grew up in, and her personal biases in terms of politics and culture. I doubt that she would be excited about mapping her views expressed above to other fields. I have pointed out this hypocrisy before, that the norms and biases we hold tend to hold result in wildly different standards of rational integrity and model building depending on the domain we are addressing. As a cognitive scientist, Gopnik should know better!

There are two key problems here which is the real root of Gopnik's issue (at least on the objective big picture scale as opposed to tactical sniping within-field). Verbal expositions about variation tend to underemphasize the various subcomponents of the variation, amongst those "in the know," the problems are known, accounted for, and we keep moving on operationally with our mental process. But, amongst those not in the know, it seems we are ignoring the other components of variation. Obviously this is a function of clarity and reiteration. But, there is another problem: the tendency toward cognitive models to canalize into typologies. Two independent variables contribute to this specifically, human bias toward slotting statistical-populational concepts and trends into idealized types, boxes, bins and categories, usually in relation to "theories" in a Gopnikesque manner, and, frankly, stupidity. A large proportion of the human race, perhaps the vast majority with sub-115 IQs, simply can not understand basic statistics, conditional probabilities, and so on. But even for those whose intelligence and motivation is high, it can be difficult to break out of preconceived boxes. The "Nature vs. Nurture" box is a very powerful one, and I believe it coopts a natural tendency to think in terms of black and white dichotomies. We have to climb up the hill of our biases with the rope and ladder of logical-abstraction, and allow ourselves to be guided by the system and the mathematical logic. Simply keep the faith alive!

OK, enough polemics, I want to leave you with some positive.

R = h²X S

R = response, S = selection and h² is the narrow sense heritability I mentioned above (additive genetic variance). You often hear talk of evolution, but what about quantitizing how fast it occurs? This "prediction equation" comes out of animal breeding, but the basic principle is obvious, the response to selection is a function of the amount of selection you are engaging in (i.e., selecting a subset of the overall population via some truncation for minimum phenotype value) and the narrow sense heritability. Over time if the selection is strong enough the additive genetic variance should be exhausted, and "evolution" should stop. This sort of start-and-stop gradualism is common in microevolutionary processes (though even in breeding it takes a while, especially in a large population). When someone wonders if black Americans are somehow more robust because of slavery, remember, you need to have high heritability and strong selection for "robustness" to be shifted in just a few generations. You too can "do the math"!

Molecules and phylogenies

Back in the mid-1980s we started hearing a lot about "African Eve." As we note on this weblog, there were two parts of this:

1) A caution not to overread the results into imagining a single foremother.
2) The popular press basically ignoring this caution.

But here are a few "fun facts."

1) The probability of fixation of a new mutation is 1/2N, where N is the population size.
2) The time until fixation is usually 4N, where N is the population size.
3) 1/2N X 2Nμ, where μ is the mutation rate, reflects the reality that the probability of mutation times the rate of mutation results in the substitution, the turnover of alleles, on loci being a function just of mutation rate (μ).
4) Time until extinction of an allele is usually 2N_e/N X ln(2N).
5) The long term effective population is 1/t(1/N₀ + ... + 1/N_{t - 1}).

You can do some "plug & chug" in Excel if you want to check out the long term effective population, but trust me when I tell you that it is far closer to the low bound as a function of time than the high bound as far as what your intuition would tell you. That is, population bottlenecks are extremely salient demographic events, and tend to be important even after the population bounces back. This also should make one cautious about assuming that larger populations have more genetic variation than smaller ones, that is only true if you correct for the long term effective population, as often large populations are simply "blow ups" of small populations. Probability of fixation, extinction and time until fixation should give you a clue as to an important fact: lineages come and go, the only thing that is inevitable in a world without selection is extinction. If you go back 100,000 years, about 5,000 generations, it stands to reason that most lineages will have gone extinct. Think of it like surnames that are passed through the male line, how likely is it going to be that a surname will be passed for 5,000 generations through an uninterrupted line of males? (remember, mtDNA is passed through females, flip the logic!) Using some of the equations above, and considering the wild fluctuations in population size that were likely in the past, the caution seems a lot more solid I think.

Kin

r X B > C

The logic of kin selection. r = coefficient of relatedness, roughly, the chance that an allele picked from a locus will be identical by descent to an allele selected from another individual at the same locus (i.e., 1/2 between siblings, 1/2 between parents, 1/4 between grandparents & grandchildren, etc.). B = is the fitness benefit and C = cost. It's a simple way of thinking about the world, kin clearly matters. As J.B.S. Haldane was reputed to have said he would "give up his life for 2 brothers or eight cousins" (1/2 r to a sibling, 1/8 to a cousin, makes sense).

But here is one thing to consider, recent literature reviews I have seen in Molecular Markers, Natural History and Evolution as well as Cooperation among Animals suggests that kin selection can not explain the full range of eusociality amongst haplodiploid hymenoptera across that taxon. They were William Hamilton's original models because sisters are more closely related to each other, r = 0.75, than they would be to their own offspring (r = 0.5) since males are haploid. In many species the coefficients of relatedness are empirically far lower the idealized kin selection models due to multiple queens and multiple inseminations of each queen (there weren't genetic tests when Hamilton published his original paper in The Journal of Theoretical Biology). And yet eusociality exists amongst them, just as in other species who do fit the ideal kin selective parameters eusociality also flourishes. The explanatory bubble seems to have burst...or has it? One solution is that kin selection might have been the necessary condition for eusociality to evolve, that is, the ancestral state, but once eusociality was a feature of this class of organisms other mechanisms also appeared to reinforce it, allowing the relexation of the necessity for a high coefficient of relation.

This is not to deny Hamilton's point or his logic, but, like Triver's reciprocal altruism, or the older ideas of group selection, I think it should makes us cautious of "one formalization to rule them all!" Gopnik was right in that, generalization is a difficult enterprise in higher order biology, and even more difficult in the human sciences (hard to do controlled lab tests!). But that does not mean that formalization is all for naught. In the sciences we should attempt to achieve the maximum level of predictive power with the minimum number of parameters. If a given species requires a synthesis of various models in a layered palimpsest, that's life. Model building must continue, because even if each model is only a brick, the house needs those bricks to ultimately be completed.

OK, I was going to hit a lot more equations and formulas...but I've already gone WAY too long because of my inability to shut up with the words. Perhaps I'll revisit this topic. But, here are some books you might find of some interest:

Principles of Population Genetics
Genetics of Populations
Introduction to Quantitative Genetics
Genetics and Analysis of Quantitative Traits
Narrow Roads of Gene Land
Evolutionary Quantitative Genetics

Haloscan Comments