Front page Thursday, January 11, 2007 Haldane's Selection Theorem Â  posted by DavidB @ 1/11/2007 02:08:00 AM  Razib has several times mentioned an important theorem of population genetics, credited to J. B. S. Haldane, that the probability of a single favourable mutation surviving and spreading throughout a population is approximately 2s, where s is its selective advantage relative to other alleles. So for example if a mutation has a selective advantage of 1 per cent, it will have approximately a 2 per cent chance of ultimately sweeping through the population. This may seem a low probability, but if the same mutation recurs more than a few dozen times in the species, it is virtually certain to be successful. A recent paper by John Hawks and Greg Cochran makes a central use of Haldane's theorem. As a matter of historical curiosity I wanted to see Haldane's original statement of this principle, and how he derived it. Hawks and Cochran give the citation 'Haldane, J. B. S. (1927): A mathematical theory of natural and artificial selection, part v: Selection and mutation. Proceedings of the Cambridge Philosophical Society, 28, 838-44.' I find that the correct volume number is 23, not 28. The same incorrect citation is given by Crow and Kimura's An Introduction to Population Genetics Theory (1970).The paper is also reprinted in Selected Genetics Papers of J. B. S. Haldane, edited with an introduction by Krishna R. Dronamraju, Garland Publishing, NY, 1990, at pp. 90-96. I have transcribed the key passage from the paper below. Incidentally, Haldane uses k, rather than s, for the selective advantage, but this is a trivial matter of notation. More important, Haldane's theorem only applies where the population is sufficiently large. If the breeding population n is very small (less than about 50) there is a significant chance that a single mutation will be fixed by genetic drift before selection has had much effect. For a single neutral or favourable mutation there is always a probability of at least 1/2n that it will ultimately be fixed, since in the absence of recurrent mutation one of the 2n genes in the population will ultimately be the ancestor of all surviving genes. It is interesting that Haldane himself does not claim great novelty for his 1927 theorem, but presents it as an application of the methods set out by R. A. Fisher in his 1922 paper 'On the dominance ratio'. Fisher himself later gave a more elaborate treatment of the same problem in a 1930 paper, 'The Distribution of Gene Ratios for Rare Mutations', Proceedings of the Royal Society of Edinburgh, 1930, 50, 205-20. (Like most of Fisher's papers, this is available online from the Fisher archives at Adelaide.) This includes the conclusion that:The probability of a mutant, enjoying a small selective advantage a, spreading until it establishes itself throughout the entire population is thus found to be 2a/(1 - e^-4an); it is easy to see that with an indefinitely large population, or in any case if 4an [population size x selective advantage] is large, this expression reduces to 2a. Thus a mutation conferring a selective advantage of 1 per cent will have practically a 2 per cent chance of establishing itself.In 1930 Fisher, rather characteristically, did not cite Haldane's 1927 treatment of the selection problem. Sewall Wright, in his 1931 paper on 'Evolution in Mendelian populations', states a similar proposition and credits Fisher, but not Haldane, with prior publication. This might raise a question whether Haldane or Fisher should be given the main credit for the theorem. As far as I can judge, Haldane does make a heavy use of Fisher's 1922 methods, but Fisher himself did not (in 1922) explicitly calculate the probability that an advantageous mutation would survive, and Haldane's application of Fisher's methods in 1927 was by no means as easy as falling off a log. I think therefore that Haldane does deserve the credit for the theorem itself.In any case, as often turns out in issues of priority, there are neglected earlier candidates. As early as 1873, the Rev H. W. Watson, in response to a request from his friend Francis Galton, had published a method for calculating the 'extinction of surnames' which contains much of the technique later used by Fisher for calculating the random extinction of genes. And even earlier, in 1845, a French mathematician called Bienayme had published a brief paper with results suggesting that he may have discovered some of the same methods, though he did not give details. (See Michael Bulmer, Francis Galton: Pioneer of Heredity and Biometry (2003), pp.156-60.) There is nothing to suggest that Fisher or Haldane knew of these earlier efforts. The following extract from Haldane's 1927 paper contains his proof of the theorem. I cannot reproduce all of the mathematical notation, so I will use S to indicate summation, x^n to indicate the n'th power of x, and x_a (etc) to indicate subscripts. I have added some comments (in square brackets) and notes in an attempt to explain the derivation, as far as I can follow it myself. No doubt much of this will be self-evident to expert mathematicians, but expert mathematicians have a tendency to underestimate the difficulty of following mathematical proofs for the rest of us. EXTRACT...the treatment of Fisher  is followed. In a large population let p_r be the chance that a factor [allele] present in a zygote at a given stage in the life-cycle will appear in r of its children in the next generation. If the individual considered is homozygous, this is the chance of leaving r children, if mutation is neglected. Let S p_r(x^r) = f(x) [note 1]. Therefore f(1) = 1, f(0) = p_0 [note 2], the probability of the factor disappearing, while f ' (1) = S rp_r, [note 3] i.e. the probable number of individuals possessing the factor in the next generation. The probability of m individuals bearing one each of the factors considered leaving r descendants is clearly the coefficient of x^r in [f(x)]^m, if we neglect the possibility of a mating between two such individuals, which we may legitimately do if m is small compared with the total number of the population. If then the probability of the factor being present in r zygotes of the nth generation be the coefficient of x^r in F(x), the corresponding probability in the (n + 1)th generation is the same coefficient in F[f(x)]. Hence if a single factor appears in one zygote, the probability of its presence in r zygotes after n generations is the coefficient of x in S(x) [S has subscript f and superscript n], i.e. f(f(f...f(x)...)), the operation being repeated n times [note 4]. The probability of its disappearance is therefore LtS(0) [note 5]. By Koenigs' theorem this is the root of x = f(x) in the neighbourhood of zero [note 6]. Now in the case of a dominant factor appearing in a population in equilibrium, and conferring an advantage measured by k, as in Part I(5) [of Haldane's series of papers] f ' (1) = 1 + k. Since f ' (x) and f '' (x) are positive, x = f(x) has two and only two real positive roots, one equal to unity, the other lying between 0 and 1, but near the latter value if k be small [note 7]. Hence any advantageous dominant factor which has once appeared has a finite chance of survival, however large the total population may be.If a large number of offspring is possible, as in most organisms, the series p_n approximates to a Poisson series, provided that adult organisms be counted, and sincef'(1) = 1 + k, f(x) = e^(1 + k)(x - 1). Hence the probability of extinction 1 - y is given by 1 - y = e^-(1 + k)y [note 8]. Hence (1 + k)y = -log(1 - y) [note 9]and k = y/2 + (y^2)/3 + (y^3)/4 + ... [note 10]and if k be small, y = 2k approximately [since the terms after y/2 are orders of magnitude smaller]. Hence an advantageous dominant gene has a probability 2k of survival after only a single appearance in an adult zygote, and if in the whole history of the species it appears more than log_e2/2k times [the natural logarithm of 2, divided by 2k] it will probably spread through the species. But, however large k may be, the factor may be extinguished after a single appearance. Thus, if k = 1, so that the new type probably leaves twice as many offspring as the normal, the probability of its extinction is still .203. If in any generation there are m dominant individuals the probability of extinction is reduced to y^m, where y is the smaller positive root of x = f(x). When k is small this reduces to (1 - 2k)^m. Hence if in any generation more than log_e2/2k adult dominants exist, the factor will probably spread through the whole population. NOTES [Note 1] Summation is over values of the subscript r from zero to infinity. Since r is the number of an individual's offspring, it can only take integer values. [Note 2] The sum is p_0(x^0) + p_1(x^1) + p_2(x^2) + p_3(x^3)... For the value x = 1 this equals p_0 + p_1 + p_2 + p_3... , which sums to 1, since the probabilities are mutually exclusive and cover all possible outcomes. For the value x = 0 the terms from p_1(x^1) onward are all equal to zero, so the series reduces to p_0(0^0), which = p_0 if we accept that 0^0 = 1. I am not sure that there is any agreed convention on this point, as the usual proofs that x^0 = 1 (e.g. 1 = (x^a)/(x^a) = x^(a - a) = x^0) break down for the case x = 0. Fisher, in his formulation of the series, presents it as p_0 + p_1(x^1) + p_2(x^2) + p_3(x^3), which avoids any doubt on this point.[Note 3] Summation is over values of r from zero to infinity. The expression f ' (1) evidently means the first derivative of the function f(x) with respect to x, for the value x = 1. Each term in the sum f(x) has the form p_r(x^r), so by elementary calculus its first derivative is rp_r(x^(r - 1)), and the first derivative of the sum f(x) as a whole is the sum of the first derivatives of its component terms. But for the value x = 1 these each reduce to rp_r, so the first derivative of f(x) is equal to S rp_r, as in Haldane's formula. [Note 4] Haldane's treatment here is closely based on Fisher, but uses different notation.[Note 5] Lt is the limit as n goes to infinity, in other words as the function is iterated indefinitely with its own previous value as argument. [Note 6] This part of the derivation evidently requires advanced knowledge of the theory of functions, and I can only take it on trust. Haldane gives the cryptic reference 'Koenigs. Darb. Bull. (2) 7, p.340, 1883.' I wondered if 'Koenigs' might be an error for 'Koenig', possibly Julius Koenig, but on consulting the Dictionary of Scientific Biography I found an entry for a French mathematician, Gabriel Koenigs (1858-1931). Among his many accomplishments, according to the DSB, Koenigs was 'one of the first to take an interest in iteration theory'. Koenigs was also a pupil of Gaston Darboux, founder and editor of the Bulletin des Sciences Mathematiques et Astronomiques, sometimes known as 'Darboux's Bulletin'. So Haldane's cryptic reference to 'Darb. Bull.' is probably to an article by Gabriel Koenigs in the Bulletin des Sciences Mathematiques et Astronomiques, 2nd series, volume 7, 1883, at p.340. I find that there is indeed an article by Koenigs, 'Recherches sur les substitutions uniformes', at the cited place, but its mathematics is far beyond me.[Note 7] Haldane does not consider the trivial cases p_0 = 1, where the gene is always extinguished immediately, or p_1 = 1, where it always produces one and only one offspring per generation. In the first case f(x) would have the value 1 for all values of x, while in the second case it would have the value x for all values of x. For non-trivial cases the value of the function f(x) increases continually and at an increasing rate with increasing x, so the first and second derivatives of f(x) are positive. Haldane also states that the equation x = f(x) has two and only two real positive roots, one of which is 1, and the other between 0 and 1. I am not sure I could give a rigorous proof of this, but I suggest the following approach. If we plot a graph of the function y = f(x), for real positive values of x, then f(x) = y = x wherever the curve representing the function touches or intersects a line through the origin at 45 degrees to the axes. As f ' and f '' (the first and second derivatives) are both positive, the curve is concave upwards, i.e. it has just one 'bend', with the inner part of the bend on its upper side. The curve must start from the y axis at the value p_0, since we know that f(0) = p_0, which is somewhere between 0 and 1. We also know that f(1) = 1, so the curve must touch or intersect the 45 degree line at x = y = 1, which is accordingly a real positive root of x = f(x). The curve must therefore pass in some way from the point (x = 0, y = p_0), which is above the 45-degree line, to the point (x = 1, y = 1), which is on that line. As the curve is concave upward, it cannot touch or intersect a straight line in more than two points. Since it touches or intersects the 45-degree line at the point (x = 1, y = 1), there are three possible ways of doing so: (a) it is a tangent to the line at that point; (b) it intersects the line at that point from above the line (in its approach from its starting point on the y axis); or (c) it intersects the line from below the line. But it cannot be a tangent to the line at that point, because the first derivative of the function would then be equal to the slope of the line, which is 1, whereas by assumption the first derivative at that point is 1 + k, with non-zero k. Possibility (a) is therefore excluded. Possibility (b) can be excluded for a similar reason. If the curve intersects the 45-degree line from above, a tangent to the curve at that point must make an angle of less than 45 degrees with the x axis. But this means that the first derivative of the function at the point of intersection must be less than 1, whereas by assumption it is greater than 1. This leaves only possibility (c), that the curve intersects the line from below, which is consistent with the fact that the first derivative is greater than 1, so that a tangent to the curve would have an angle greater than 45 degrees. But this implies that in passing from (x = 0, y = p_0), which is above the line, to (x = 1, y = 1), it has first intersected the line at some other value of x between 0 and 1, which is another real positive root of x = f(x), as required by Haldane's proof. [Note 8] Here Haldane follows Fisher closely, but with a significant modification. Fisher (1922) had given the formula f(x) = e^m(x - 1), where m:1 is the ratio of the total population in generation (n + 1) to the population in generation n; in other words he allows for the growth of a neutral allele in line with any expansion of the population as a whole. Haldane's significant innovation is to allow for genes which increase in numbers because of a selective advantage, k, relative to other genes in a static population. Fisher's f(x) = e^m(x - 1) then becomes Haldane's f(x) = e^(1 + k)(x - 1), and Haldane is able to find an explicit expression for the chance of survival in terms of the selective advantage k (see Note 9). [Note 9] Haldane denotes the probability of the gene surviving as y, so the probability of extinction is 1 - y. But the probability of extinction is the appropriate root of the equation x = f(x) = e^(1 + k)(x - 1). So substituting 1 - y for x we get 1 - y = e^-(1 + k)y. In the right hand side of this equation the expression -(1 + k)y is the (natural) logarithm of the left hand side, which is 1 - y, so, taking the negative of both sides, (1 + k)y = -log(1 - y).[Note 10] Here Haldane assumes knowledge of a theorem equivalent to -log(1 - y) = y + (y^2)/2 + (y^3)/3 + (y^4)/4: see this Wiki article on logarithms. Given this theorem, we have(1 + k)y = -log(1 - y) = y + (y^2)/2 + (y^3)/3 + (y^4)/4 ...therefore, dividing both sides by y and then subtracting 1 from both sides, we get k = (y^2)/2y + (y^3)/3y + (y^4)/4y ... = y/2 + (y^2)/3 + (y^3)/4 + ...as stated by Haldane. input{width:180px; border: thin #000000 inset;} .sub{width:180px; border: thin #000000 inset;}  Razib's Home Page GNXP Archives Interviews Blogroll Principles of Population Genetics Genetics of Populations Molecular Evolution Quantitative Genetics Evolutionary Quantitative Genetics Evolutionary Genetics Evolution Molecular Markers, Natural History, and Evolution The Genetics of Human Populations Genetics and Analysis of Quantitative Traits Epistasis and Evolutionary Process Evolutionary Human Genetics Biometry Mathematical Models in Biology Speciation Evolutionary Genetics: Case Studies and Concepts Narrow Roads of Gene Land 1 Narrow Roads of Gene Land 2 Narrow Roads of Gene Land 3 Statistical Methods in Molecular Evolution The History and Geography of Human Genes Population Genetics and Microevolutionary Theory Population Genetics, Molecular Evolution, and the Neutral Theory Genetical Theory of Natural Selection Evolution and the Genetics of Populations Genetics and Origins of Species Tempo and Mode in Evolution Causes of Evolution Evolution The Great Human Diasporas Bones, Stones and Molecules Natural Selection and Social Theory Journey of Man Mapping Human History The Seven Daughters of Eve Evolution for Everyone Why Sex Matters Mother Nature Grooming, Gossip, and the Evolution of Language Genome R.A. Fisher, the Life of a Scientist Sewall Wright and Evolutionary Biology Origins of Theoretical Population Genetics A Reason for Everything The Ancestor's Tale Dragon Bone Hill Endless Forms Most Beautiful The Selfish Gene Adaptation and Natural Selection Nature via Nurture The Symbolic Species The Imitation Factor The Red Queen Out of Thin Air Mutants Evolutionary Dynamics The Origin of Species The Descent of Man Age of Abundance The Darwin Wars The Evolutionists The Creationists Of Moths and Men The Language Instinct How We Decide Predictably Irrational The Black Swan Fooled By Randomness Descartes' Baby Religion Explained In Gods We Trust Darwin's Cathedral A Theory of Religion The Meme Machine Synaptic Self The Mating Mind A Separate Creation The Number Sense The 10,000 Year Explosion The Math Gene Explaining Culture Origin and Evolution of Cultures Dawn of Human Culture The Origins of Virtue Prehistory of the Mind The Nurture Assumption The Moral Animal Born That Way No Two Alike Sociobiology Survival of the Prettiest The Blank Slate The g Factor The Origin Of The Mind Unto Others Defenders of the Truth The Cultural Origins of Human Cognition Before the Dawn Behavioral Genetics in the Postgenomic Era The Essential Difference Geography of Thought The Classical World The Fall of the Roman Empire The Fall of Rome History of Rome How Rome Fell The Making of a Christian Aristoracy The Rise of Western Christendom Keepers of the Keys of Heaven A History of the Byzantine State and Society Europe After Rome The Germanization of Early Medieval Christianity The Barbarian Conversion A History of Christianity God's War Infidels Fourth Crusade and the Sack of Constantinople The Sacred Chain Divided by the Faith Europe The Reformation Pursuit of Glory Albion's Seed 1848 Postwar From Plato to Nato China: A New History China in World History Genghis Khan and the Making of the Modern World Children of the Revolution When Baghdad Ruled the Muslim World The Great Arab Conquests After Tamerlane A History of Iran The Horse, the Wheel, and Language A World History Guns, Germs, and Steel The Human Web Plagues and Peoples 1491 A Concise Economic History of the World Power and Plenty A Splendid Exchange Contours of the World Economy 1-2030 AD Knowledge and the Wealth of Nations A Farewell to Alms The Ascent of Money The Great Divergence Clash of Extremes War and Peace and War Historical Dynamics The Age of Lincoln The Great Upheaval What Hath God Wrought Freedom Just Around the Corner Throes of Democracy Grand New Party A Beautiful Math When Genius Failed Catholicism and Freedom American Judaism Search this site: (2005 - present, Blogger Archives) Hello Movable Type archives August 11,2002 August 18,2002 August 25,2002 September 01,2002 September 15,2002 October 20,2002 December 08,2002 December 22,2002 December 29,2002 January 05,2003 January 12,2003 January 19,2003 January 26,2003 February 02,2003 February 09,2003 February 16,2003 February 23,2003 March 02,2003 March 09,2003 March 16,2003 March 23,2003 March 30,2003 April 06,2003 April 13,2003 April 20,2003 April 27,2003 May 04,2003 May 11,2003 May 18,2003 May 25,2003 June 01,2003 June 08,2003 June 15,2003 June 22,2003 June 29,2003 July 06,2003 July 13,2003 July 20,2003 July 27,2003 August 03,2003 August 10,2003 August 17,2003 August 24,2003 August 31,2003 September 07,2003 September 14,2003 September 21,2003 September 28,2003 October 05,2003 October 12,2003 October 19,2003 October 26,2003 November 02,2003 November 09,2003 November 16,2003 November 23,2003 November 30,2003 December 07,2003 December 14,2003 December 21,2003 December 28,2003 January 04,2004 January 11,2004 January 18,2004 January 25,2004 February 01,2004 February 08,2004 February 15,2004 February 22,2004 February 29,2004 March 07,2004 March 14,2004 March 21,2004 March 28,2004 April 04,2004 April 11,2004 April 18,2004 April 25,2004 May 02,2004 May 09,2004 May 16,2004 May 23,2004 May 30,2004 June 06,2004 June 13,2004 June 20,2004 June 27,2004 July 04,2004 July 11,2004 July 18,2004 July 25,2004 August 01,2004 August 08,2004 August 15,2004 August 22,2004 August 29,2004 September 05,2004 September 12,2004 September 19,2004 September 26,2004 October 03,2004 October 10,2004 October 17,2004 October 24,2004 October 31,2004 November 07,2004 November 14,2004 November 21,2004 November 28,2004 December 05,2004 December 12,2004 December 19,2004 December 26,2004 January 02,2005 January 09,2005 January 16,2005 January 23,2005 January 30,2005 February 06,2005 February 13,2005 February 20,2005 February 27,2005 March 06,2005 March 13,2005 March 20,2005 March 27,2005 April 03,2005 April 10,2005 April 17,2005 April 24,2005 May 01,2005 May 08,2005 May 15,2005 May 22,2005 May 29,2005 June 05,2005 June 12,2005 June 19,2005 June 26,2005 July 03,2005 July 17,2005 August 07,2005 Blogspot archives June 2002 July 2002 August 2002 September 2002 October 2002 November 2002 December 2002