Thursday, May 08, 2008
Continuing my series of notes on the work of Sewall Wright, this one deals with the subject of genetic drift. I had originally planned to call this note 'Inbreeding and the decline of genetic variance', but anyone interested in the matters covered here, and searching for them on the internet, is far more likely to search for 'genetic drift'. This is one of the subjects most closely associated with Wright, to the extent that genetic drift was formerly often known as the 'Sewall Wright Effect'. My main aim is to help people follow Wright's own derivation of his key results, and to clarify the relationship between genetic drift and inbreeding.
I will refer mainly to the papers reprinted in the collection Evolution: Selected Papers, (ESP) and especially the monumental 1931 paper on 'Evolution in Mendelian Populations', which is available online here. Anyone interested in Wright should also read William B. Provine's biography of him. If in these notes I occasionally make critical remarks on Provine, it should not detract from the general excellence of his book. See the References for details. In an infinitely large population, in the absence of selection and mutation, the proportions of different gene types (alleles) in the population will remain unchanged indefinitely. But real populations are never infinitely large, and gene frequencies will fluctuate to some extent by chance. As Wright put it in 1931, 'Merely by chance one or the other of the allelomorphs [alleles] may be expected to increase its frequency in a given generation and in time the proportions may drift a long way from the initial values' (ESP, p.107.) The general nature of drift can be illustrated by the hackneyed example of coin tossing. If we simultaneously toss a number of 'fair' coins, and repeat the trial a large number of times, then the average proportion of heads, by the definition of a fair coin, will be 1/2, and the average number of heads per trial will be N/2, where N is the number of coins in a trial. More generally, suppose the probability of heads for each coin is always p, where p is any fraction between 0 and 1. The long term average number of heads per trial will then be Np. But on any particular trial, purely by chance, the number of heads is likely to deviate from the average. It can be shown that the variance of the number of heads per trial is Npq, where q = 1 - p. [Note 1] If we are interested in the proportion of heads per trial (the number of heads divided by N), it can be shown that the variance of the proportion is pq/N. [Note 2] On each trial, the proportion of coins is therefore likely to deviate from the long term average by a quantity related to pq/N.Departing now from the real behaviour of coins, let us suppose that the value of p on each trial is determined by the proportion of heads in the previous trial. The proportion of heads will then drift up and down in a 'random walk' pattern, with the size of the 'steps' being inversely related to the size of N. If N is very large, each step will be small, but if N is small the steps may be relatively large. If, by chance, the proportion of heads in a trial ever reaches 1 or 0, then p for all future trials will also be 0 or 1, and heads (or tails) will be permanently 'fixed'. This is very likely to happen sooner or later. Genes are not coins, so the analogy is not perfect. In a population of genes, the replication of each gene is not a simple matter of 'heads or tails', as each gene may have 0, 1, 2 or more descendants. Also, while the number of coins is assumed to be fixed at N, a biological population is seldom absolutely fixed in size. Nevertheless, there are important similarities. In the absence of selection, it is a matter of chance whether or not a particular gene enters an egg or sperm and then survives to reproduce again in the next generation. Suppose that there are two alleles, A and B, at each locus, with the frequencies p and q in the population. In the absence of selection and mutation, these will also be the expected frequencies in the next generation. In a population of N diploid individuals, there are 2N genes in the population at each locus. In a stable population there will still be 2N genes in the next generation. We can schematically represent reproduction as a 'trial' consisting of 2N events, each involving the random choice of a gene to enter the new generation, with probabilities of p and q for the 'outcomes' A and B at each choice. The probabilities of obtaining the various possible combinations of A's and B's are then given by the expansion of the binomial (p + q)^2N. Wright himself uses this model of the process on several occasions, e.g. ESP p.289. While this may seem a very artificial way of viewing reproduction, it is not as unrealistic as it seems. Suppose that N diploid individuals each have the same number of offspring, the number being large, and certainly large enough to ensure that there are at least 2N copies of each allele among the population of offspring. Then select N of the offspring as 'survivors', completely at random, which is analogous to survival in a resource-limited population without natural selection. The probability of the various possible gene frequencies will then be approximately as in the schematic model (with the complication that in a finite population of offspring the probability of selecting an offspring with a given allele will be affected by the number already selected, e.g. if nearly all the alleles of a given type have, by chance, already been selected, the probability of selecting another one will be much reduced). Nothing has so far been said about inbreeding. Moreover, the processes just described would apply not only to sexually reproducing organisms but also to asexually reproducing organisms and genetic elements, such as mitochondria and Y chromosomes, where the possibility of inbreeding does not arise. But in Wright's treatment of the subject, references to inbreeding are frequent, and the rate of genetic drift is derived by an argument which seems to depend on the existence of inbreeding. For example, on p.165 of ESP he says: 'If the population is not indefinitely large, another factor must be taken into account: the effects of accidents of sampling among those that survive and become parents in each generation and among the germ cells of these, in other words, the effects of inbreeding'. Such statements are likely to give the impression that inbreeding is fundamental to the process of genetic drift. How can this be? The explanation is that in a sexually reproducing population a convenient measure of genetic drift is the changing proportion of homozygotes, and the existence of homozygotes is related to inbreeding. If a given allele has ultimately arisen from a single mutation, then homozygous copies of that allele can only occur in the same individual if that individual is descended from the same ancestor by at least two paths, which is by definition inbreeding. Even if the allele has more than one origin, the level of inbreeding in the population will affect the level of homozygosis. But as the example of asexual organisms shows, there is no necessary connection between genetic drift and inbreeding. R. A. Fisher, in his different approach to the subject, does not (I think) ever refer to inbreeding. Confusing the two things would be like confusing the study of heat with the study of thermometers. It may therefore be wondered why Sewall Wright took his particular approach. The answer may be partly that his mathematical training was less advanced than Fisher's, so that he was obliged to use less mathematically sophisticated methods. This has the advantage that his work on the subject is in principle accessible to a wider range of readers. Moreover, on one important point Wright's methods got the correct result where Fisher, through neglecting a quantity which turned out not to be negligible, got the wrong result by a factor of 2 (as Wright never tired of pointing out). But I think the main reason for Wright's approach was that he first investigated genetic drift in the context of agricultural breeding, where livestock are often closely inbred. In this context one of the main concerns is to quantify the loss of genetic variation in each particular inbred strain. It was therefore natural for Wright to approach the subject by measuring the loss of heterozygosis associated with inbreeding. When he later turned to consider genetic drift in natural populations, where mating is approximately random, he continued to use the methods he had already devised for the study of inbreeding in agriculture. (I will not now explore the precise meaning of Wright's coefficients of inbreeding (the famous F-statistics) which I hope to deal with in another note.) Wright's most important finding was that heterozygosis (the proportion of heterozygotes in the population) tends to decline at a rate of 1/2N per generation, where N is the diploid population size. (This assumes that males and females each have a population size of N/2.) Most textbooks give a simplified version of Wright's derivation of this result. Wright's own treatment, in EMP, is difficult to follow, and in view of its importance I have provided a guide in Note 3 below. Even the simplified textbook versions are not always very clear, and I do not know of any wholly satisfactory account. Key assumptions are often not clearly stated or justified. Two relatively good accounts are those of Falconer and Maynard Smith (see Refs.) I will outline a derivation based mainly on Falconer (with some modifications). Let us assume there is a population of N diploid individuals. Generations are separate. There is no mutation or natural selection in the period under consideration. The n'th generation is designated Gn, the previous generation by Gn-1, the following generation by Gn+1, and so on. The probability that the two genes at the same locus in an individual of Gn are identical is designated CIn, where CI stands for 'coefficient of inbreeding'. (For my approach here it is not necessary to specify whether the genes are identical 'by descent'.) The probability that two randomly selected genes at the same locus in two different individuals of Gn are identical is designated CKn, where CK stands for 'coefficient of kinship'. For the simplest case, consider a population of hermaphrodites which are capable of self-fertilisation and mate completely at random, including with themselves. (This would be approximately true of some marine invertebrates which release gametes into the water.) From the assumptions of random mating and non-selection it follows that any individual in Gn is equally likely, with probability 2/N, to be a parent of any individual in Gn+1 (since in a stable population each individual will have on average have 2 out of the N surviving offspring). It does not follow that, if we select at random an individual in Gn+1, and then select another, there is a probability of 2/N that the second individual will have the same father (or mother) as the first. For example, if each individual in Gn produced exactly 2 surviving offspring, the probability that a second randomly selected individual in Gn+1 had the same father (or mother) as the first would only be 1/(N-1). To get a probability of 2/N we require an additional assumption, which is technically satisfied by specifying that the number of offspring for individuals follows a Poisson distribution. (This assumption is mentioned by Maynard Smith but not by Falconer.) With these assumptions, it follows that CIn equals CKn. In the case of CIn, we select a gene at random in Gn, and then inquire whether the other gene at the same locus in that individual is identical. In the case of CKn, we select a gene at random in Gn, and then inquire whether another randomly selected gene at the same locus in a different randomly selected individual is identical to the first gene. But in both cases each gene is a copy of a gene taken absolutely at random from all the genes in Gn-1. The probabilities of identity are therefore the same, and CIn therefore equals CKn. By the same argument it follows that any two randomly selected distinct genes at a locus in Gn have the same probability of being identical, whether they are in the same or different individuals. If we call this probability CDn, we have CDn = CIn = CKn, for any value of n. But CIn can be broken down into two component probabilities. With probability 1/2N, the two genes at a locus in the same individual are copies of the very same gene in Gn-1, in which case they are certainly identical. In all other cases, therefore with probability 1-1/2N, they are copies of two distinct genes in Gn-1, in which case there is a probability CDn-1 that they are identical. But CDn-1 = CIn-1 (since the equality CDn = CIn applies for any value of n). The total probability CIn therefore comes to CIn = 1/2N + (1 - 1/2N)CIn-1. The coefficient of inbreeding in one generation is therefore derivable from the coefficient in the previous generation by a formula involving the addition of 1/2N. It can further be shown, with a little algebraic manipulation, that heterozygosis tends to decline by a factor of (1 - 1/2N) per generation (see Falconer p.64-5 for a proof). If self-fertilisation is excluded, two genes in the same individual cannot be copies of the very same gene in the previous generation, so the analysis needs to be pushed further back. If mating between different individuals is completely random, including siblings, then CIn = CKn-1. If mating between siblings is excluded, but otherwise random, CIn = CKn-2, and so on. But it is always possible to express the 'coefficient of inbreeding' in one generation in terms of the coefficients in previous generations, and heterozygosis always tends to decline by a factor of (1 - 1/2N) per generation (assuming equal numbers of males and females). The above argument, like Wright's own, measures the progress of genetic drift by the decline of heterozygosis and the associated increase in the coefficient of inbreeding. It should however be clear that this is not essential. If we wanted to study genetic drift in asexual haploid replicators, such as Y chromosomes, it would be possible to modify the derivation to use only coefficients of kinship, rather than inbreeding. More fundamentally, the process of genetic drift depends not on inbreeding but on the existence of variance in reproductive success. Some genes have no descendants, some have only one, and some have more than one. Over the course of time, more and more lines of descent die out, and the surviving genes are collectively descended from fewer and fewer original ancestors. Ina sexually reproducing population this also leads to increased levels of inbreeding, in a broad sense. If there were no such variance in reproductive success - if every gene had exactly the same number of surviving 'offspring' - there would be no genetic drift. Among diploids, the variance in replication of individual genes is due to two factors: the variance in the number of surviving offspring, and the random allocation of genes to gametes in the process of meiosis. Even if every diploid individual had exactly the same number of surviving offspring, there would still be variance in the replication of individual genes for the second reason. As for the variance in the number of offspring, the assumption of a Poisson distribution is probably not unreasonable in many species, but there could be departures from it in both directions (i.e. either greater or smaller variance). There might also be different variance in the two sexes. For example, among animals like Elephant Seals, the variance among females might be rather small, because all females have a low but steady rate of reproduction, whereas among males the variance would be much higher, as many males have no offspring at all, while a few have a large number. Wright takes account of some of these factors in his discussions of 'effective population size', This note has only dealt with a few aspects of Wright's work on genetic drift. I have tried to identify the underlying assumptions and (in Note 3) to clarify Wright's most important derivation. None of this says anything one way or the other about the actual importance of genetic drift in evolution. What should be clear is that genetic drift is a weak force except in very small populations, since its effect is inversely proportional to population size. In large populations it would be overpowered by modest rates of selection or migration. (The other factor to consider is mutation, but except in large populations this is an even weaker force than drift, as mutation rates are typically of the order of only 1/100,000 per generation.) I hope to deal with some of these issues in further notes. Note 1: Suppose we toss a single coin K times, where K is a large number. If the probability of heads is p, the total number of heads will be Kp and the average number of heads per toss will be Kp/K = p. But on each particular trial (the toss of a single coin) there can only be 1 or 0 heads, so we will have Kp trials with the deviation value (1 - p), and K(1 - p) trials with the deviation value (0 - p) = - p. Using the abbreviation q for (1 - p), the variance of the number of heads for trials consisting of a single coin toss is therefore [Kpq^2 + Kqp^2]/K = pq^2 + qp^2 = pq(q + p) = pq. It may seem odd to speak of the variance of the number of heads in trials where there is only one coin per trial, but in principle it is legitimate, and it enables us easily to derive the variance of the number of heads where the trials involve N coins. Since the variance of the sum of a number of independent numerical values equals the sum of the variances of the values individually, the variance of the number of heads in N independent coin tosses, each with variance pq, is simply Npq. Note 2: The average proportion of heads per trial of N coin tosses, each with probability p, is in the long term p. If X is the number of heads in any particular trial of N coins (where X is a variable), the deviation values of the proportions will be of the form X/N - p = (X - Np)/N, and the variance of the proportions in K trials will be S[(X - Np)/N]^2]/K. But S[(X - Np)/]^2]/K is the variance of the number of heads, which has been proved equal to Npq, so the variance of the proportion is Npq/N^2 = pq/N. Note 3: This is a commentary on pages 108-110 of ESP, which reprints pages 107-109 of the original paper EMP (the near identity of pagination is just a coincidence). I will mainly be concerned with page 109 of ESP, where Wright derives his fundamental results for the decline of heterozygosis. In following the derivation it is necessary to refer back frequently to the definitions at the bottom of page 108. Wright assumes that the sexes are separate (so there is no self-fertilisation) but that mating is otherwise completely random, including between siblings. He assumes that there are Nm breeding males and Nf breeding females. With random mating, he states that the proportion of matings between full siblings is 1/NmNf. This evidently assumes that there is a probability of 1/Nm that two mates have the same father, and an independent probability of 1/Nf that they have the same mother (note that m and f stand for male and female, not mother and father). This is actually a strong assumption, which ought to be clearly stated. It assumes (a) that the number of offspring of individuals follows a Poisson distribution (or something similar) and (b) that parents have male and female offspring in the same proportions as in the population generally. This is not necessarily true: for example if some parents had a strong bias towards producing male or female offspring, the probability of mating between siblings would be reduced. (Wright does discuss some of these considerations in the section on 'The Population Number' at pp.111-12 of ESP.) Wright then gives the proportion of matings between half siblings, and between all less closely related individuals. These depend on the same assumptions as for full siblings. He then gives a formula for M, the correlation between mates in the current generation. Note that the formula is of the form a'^2b'^2[Z], where Z is a complicated expression in square brackets. From the definitions on p.108 we have a'^2b'^2 = [1/2(1 + F')][(1 + F'')/2], so we have M = [1/2(1 + F')][(1 + F'')/2][Z]. The expression Z can be derived by Wright's method of path analysis. The first component of Z deals with the case of mating between full siblings. If we label the siblings A and B, and their parents C and D, we have two 'direct' paths, ACB and ADB, and two 'indirect' paths, ACDB and ADCB, which involve the correlation M' between mates in the previous generation. Hence the coefficient (2 + 2M') for the first component. For half siblings A and B, there is one shared parent C and two non-shared parents D and E, so there is one direct path, ACB, and the three indirect paths ADCB, ADEB, and ACEB, giving the coefficient (1 + 3M'). For unrelated mates A and B, with the non-shared parents C, D, E and G (to avoid using F, which is already in use), we have no direct paths and four indirect paths, ACGB, ACEB, ADEB, and ADGB, giving the coefficient 4M'. Next Wright derives an expression for F, the correlation between uniting gametes in the current generation. Here we must note from p.108 that F = b^2M, and b^2 = (1 + F')/2. Using the expression M = [1/2(1 + F')][(1 + F'')/2][Z], we therefore have F = [(1 + F')/2][1/2(1 + F')][(1 + F'')/2][Z] = [(1 + F'')/8][Z]. With a little manipulation, and using the full expression for Z, this can be put in the form F = (1 + F'')[Nm + Nf - M'Nm - M'Nf + 4F'NmNf]/8NmNf . But now we should note that M' is the correlation between mates in the previous generation. We can therefore adapt the equation F = b^2M to get the corresponding equation for the previous generation, i.e. F' = b'^2M'. But b'^2 = (1 + F'')/2, so F' = [(1 + F'')/2]M', and therefore M' = 2F'/(1 + F''). Substituting 2F'/(1 + F'') for M' in the equation F = (1 + F'')[Nm + Nf - M'Nm - M'Nf + 4F'NmNf]/8NmNf, it follows by some grinding but essentially routine algebra that F = Q, where Q is the expression on the right of the second equation on page 109. Then using the definition of P, P', etc, in terms of F, F', etc, the third equation also follows by routine algebra. This leaves the final death-defying leap to the fourth equation. This is not helped by the puzzling statement that we can equate P/P' to P/P''. This would imply that the proportional change per generation was not just constant but zero, and P/P'' must surely be a misprint for P'/P''. (The fact that this horrible error is not corrected or commented on in the ESP reprint leaves me wondering how closely Provine, as editor, has followed the details of Wright's text.) But even with this correction, it is far from obvious how Wright derives his fourth equation. I had given up hope of solving it until I was reading volume 2 of EGP, and found a discussion of the simpler case of random mating hermaphrodites, which fills in a few gaps in the derivation (see EGP vol 2, p.194-5). First, it confirms the suspicion that P/P'' should be P'/P''. Second it shows (or at least hints) how the problem can be reduced to a quadratic equation. Taking these hints, we can apply them to the fourth equation on p.109. First, rearrange and simplify the third equation to get P - P'[1 - (Nm + Nf)/4NmNf] - P''(Nm - Nf)/8NmNf = 0. Then divide through by P'' to get P/P'' - (P'/P'')[1 - (Nm + Nf)/4NmNf] - (Nm - Nf)/8NmNf = 0. But by assumption P/P' = P'/P'', so P/P'' = (P'/P'')^2 = (P/P')^2. We can therefore treat the equation as a quadratic of the form ax^2 + bx + c = 0, with x = P/P'. This can be solved by the standard method to get (as the larger of the two roots) P/P' = (1/2)[1 - (Nm + Nf)/4NmNf)] + (1/2)[root(1 + [(Nm + Nf)/4NmNf]^2)]. This is nearly Wright's fourth equation. For the final step, we take deltaP to mean P - P', so that - deltaP/P' = - (P/P' - 1). We therefore need only subtract 1 from the expression (1/2)[1 - (Nm + Nf)/4NmNf)] + (1/2)[root(1 + [(Nm + Nf)/4NmNf]^2)], and then reverse the sign, to get Wright's fourth equation. After this tortuous derivation, the discussion on page 110 of ESP is relatively plain sailing. The only slight puzzle is how Wright gets the approximation at the top of the page. I deduce that he uses the fact that when a is a small fraction, root(1 + a) is approximately equal to 1 + a/2. Taking [(Nm + Nf)/4NmNf]^2 as a, and grinding through the algebra, Wright's approximation can then be verified. Overall, as often with Wright's work, I am torn between admiration for his ingenuity and frustration at his obscurity. References: D. S. Falconer: Introduction to Quantitative Genetics, 3rd edn., 1989. (The 4th edn., by Falconer and Mackay (1995) appears to be the same so far as its treatment of genetic drift is concerned.) John Maynard Smith: Evolutionary Genetics, 1989. William B. Provine: Sewall Wright and Evolutionary Biology, 1986. Sewall Wright: Evolution: Selected Papers, edited and with Introductory Materials by William B. Provine, 1986. Sewall Wright: 'Evolution in Mendelian Populations', Genetics, 16, 1931, pp.97-159. (Reprinted at pp.98-160 of ESP.) Sewall Wright: Evolution and the genetics of populations, 4 vols., 1968-1978. Labels: Burbridge, Population genetics |