Friday, April 11, 2008

Notes on Sewall Wright: the Measurement of Kinship   posted by DavidB @ 4/11/2008 03:27:00 AM
Share/Bookmark

Most people with an interest in genetics will be aware that Sewall Wright made major contributions to the theory of kinship or relatedness. Fewer people will have any direct knowledge of his work on the subject, and those who do consult his writings may find them difficult. The present note is intended to help those who want to tackle Wright at first hand. See also this evaluation by the geneticist W. G. Hill.


Most of Wright's key ideas on the subject were first presented in a 5-part paper on 'Systems of Mating' (SM) in 1921. All 5 parts can be found on the internet with a little searching. SM1, which is the most fundamental, is here, and SM5, which contains a relatively un-technical summary, is here.


Rather than go straight to Wright's own approach, I will begin by comparing and contrasting it with that of the French geneticist Gustave Malecot, based on the concept of Identity by Descent. Malecot first introduced his methods around 1940, and since then they have supplanted Wright's approach, to the extent that Wright's own methods have been almost forgotten. What is presented in textbooks as due to Wright is often in reality due to Malecot. The two approaches do have some similarities, and in simple cases they lead to the same quantitative results, but there are also some important differences.



Malecot and Identity by Descent

In Malecot's system two genes at the same locus, in the same or different individuals, are defined as Identical by Descent (IBD) if they are both descended from the very same individual ancestral gene, without either of them undergoing mutation in the interim. The relatedness between two individuals can be measured, roughly speaking, by calculating the probability that two genes at the same locus in the two individuals are IBD. To do this it is necessary first to identify all the distinct paths of descent connecting the two individuals through a common ancestor, and then to calculate the probability that the same gene will have descended to both individuals from that ancestor along any given path. Since all such paths of descent are mutually exclusive (though portions of them may overlap), the resulting probabilities can be added together to give the total probability that a given gene in the two individuals is IBD. To take a simple case, consider two individuals (full siblings) who have both parents in common. I assume that the parents are not related to each other or inbred. If we select a (diploid autosomal) gene at random from one sibling, there is a probability of one-half that it comes from the mother, and, if it does, a probability of one-half that the same gene has descended from the mother to the other sibling. This gives a compound probability of one-quarter that the second sibling has received a gene from the mother that is IBD to the selected gene in the first sibling. There is likewise a probability of one-quarter that the second sibling has received an IBD copy from the father. The total probability is therefore one-half, which is often called the Coefficient of Relationship or Relatedness between full siblings. If the parents are themselves related or inbred (i.e. descended from one of their own ancestors by more than one possible path), additional paths of descent need to be taken into account. Since there are two genes at the relevant locus in the second sibling, there is a probability of one-quarter (one-half times one-half) that a particular one of these genes, chosen at random, is IBD to the selected gene in the first sibling. This is usually known as their Coefficient of Kinship. If a male and female with a non-zero Coefficient of Kinship mate together, there is a non-zero probability that any offspring will inherit two genes that are IBD to each other. This is usually known as the offspring's Coefficient of Inbreeding, and a little consideration shows that it is equal to the Coefficient of Kinship of the parents.

A point left vague in some accounts is how far back the paths of ancestry can or should be traced. There would be little point in tracing them back so far that the gene would probably have mutated along the way to one or both descendants, but with a mutation rate of only about 1 in 100,000 per generation this is not a major constraint. In practice, ancestry is seldom traced back beyond five or six generations, as the probability of Identity by Descent along any given path going beyond than this is very small (less than 1 in 1,000), and the aggregate probability along all such paths will usually be much the same for all individuals in the same population.

Wright and the Correlation between Relatives

None of this is directly due to Sewall Wright. He does uses path diagrams similar to those of Malecot (who was inspired by Wright's work), but the quantities measured along the paths are not probabilities of Identity by Descent but path coefficients. As discussed in my note on Wright's method of Path Analysis, the correlation between two variables can be derived from the path coefficients along the paths connecting them. The measures of relationship between two individuals in Wright's system are always in principle correlation coefficients. In simple cases (no inbreeding, no dominance, no assortative mating, and so on) they are quantitatively the same as Malecot's measures, but in principle they are quite different. Three important differences should be emphasised:

a) like all correlation coefficients, Wright's measures of relationship are valid only relative to a specified statistical population. The coefficient of relationship between two individuals may well vary according to the specified population; e.g. it may be different if the specified population is an ethnic group to which the individuals belong as compared with a population comprising several ethnic groups.

b) unlike probabilities, which are always positive, a correlation coefficient can be either positive or negative. In fact, although Wright seldom discusses negative relationships, within any specified population they are in principle as common as positive relationships.

c) relative to any specified population, the correlation between two randomly selected individuals from that population is zero (apart from sampling error). This point has sometimes been overlooked, for example in discussions of Hamilton's Rule. The 'r' in Hamilton's Rule should be a regression coefficient rather than a correlation coefficient (as Hamilton realised around 1970 - see Narrow Roads of Gene Land, vol. 1, p.179), but the same principle applies: the regression of one randomly selected individual on another randomly selected individual, relative to the population from which they are randomly selected, is approximately zero. Hamilton's Rule therefore predicts that altruistic behaviour will not be directed randomly towards all members of the relevant population, though it may be difficult to decide which population is 'relevant' for the purpose.

I emphasise these points partly because Wright himself does not. They are implicit in the use of correlation coefficients, but Wright seldom explicitly mentions them. An exception is in SM5, where Wright points out that the correlation between relatives within an inbred line will be small although relative to the wider population it is large. Some more general statements are made in Wright's late work on Evolution and the Genetics of Populations (EGP). In volume 2 of that work (1969) he says that 'In a panmictic [randomly mating] population, there is no correlation between homologous genes of uniting gametes relative to the gene frequencies in the whole population. On splitting up into small lines which breed within themselves, a correlation between uniting gametes is to be expected.... The relativity referred to above has sometimes been overlooked or misinterpreted. A correlation coefficient is, of course, always relative. It is a property of the population as well as the two variables....' (pp.175-77.) Wright goes on to discuss Malecot's method of Identity by Descent. He accepts that it is a useful technique and often leads to the same results as his own, but argues that his own approach is more general and in particular that his own concept of relationship allows for negative values.

Wright is often vague about the population in which the correlations are to be measured, leaving this to be inferred from the context. Sometimes the relevant population is the entire generation to which the correlated individuals belong, sometimes it is a defined sub-population, but sometimes it seems to be a 'foundation stock' from which they are descended. This is problematic, as it seems to require a correlation between individuals relative to the means and standard deviations in a population to which they do not themselves belong. I will discuss this further in dealing with Wright's work on inbreeding and genetic diversity.

Correlations between notional values

Wright was not the first person to work on the correlation between relatives. Unknown to Wright, R. A. Fisher had already treated the subject at length, by different methods, in 1918. In fact, the subject goes back at least to 1904, when Karl Pearson considered the correlations to be expected on the hypothesis of Mendelian dominant inheritance. He found that (on certain simplified assumptions) the correlation between parent and offspring would be only one-third, rather than the correlation of about one-half usually found in empirical data on human traits. Pearson considered this a serious objection to the generality of the Mendelian theory. One of the aims of Fisher's 1918 paper was to show that, when complications such as assortative mating were taken into account, the data were consistent with widespread Mendelian dominance.

The idea of a correlation between relatives is intelligible enough when the correlation involves continuous phenotypic traits such as height, but it is more obscure when the traits are purely qualitative, or when the correlation is not between phenotypes but between gametes or genotypes. If there are varying types of gametes or genotypes (e.g. different alleles at a locus) in the population, they may be said to be positively associated if the same types tend to occur together, more often than would be expected by chance, in the same individual or in certain pairs of individuals. There are several useful measures of the 'association' of qualitative variables (see any edition of G. U. Yule's Introduction to the Theory of Statistics). However, Wright (like his predecessors) preferred to use the Pearson product-moment correlation coefficient. To obtain a Pearson correlation coefficient in the case of purely qualitative variables, such as differences between alleles, it is necessary to give the correlated items notional algebraic or numerical values. Since these are to some extent arbitrary, it might be feared that this would introduce an arbitrary element into the results, but in the cases of interest the arbitrary values cancel out and leave the correlation coefficient itself unaffected.

The procedure can be illustrated by the problem of dominance, which is treated by Wright in SM1, page 117-8. If we assign the homozygotes AA and aa the arbitrary values 1 and 0 respectively, in the case of complete dominance of A, the heterozygote Aa will have the value 1, while in the case of zero dominance it will have the value 1/2. Each individual in the population will therefore have a pair of numerical values, under the assumptions of dominance and non-dominance respectively. For homozygotes the two values will be the same but for heterozygotes they will be different. If the frequencies of the various genotypes in the population are specified, the means and standard deviations of the numerical values can be calculated, and the covariance and the correlation coefficient between the pairs of values can then be derived in the usual way. The correlation coefficient will be unaffected if one or both variables are systematically multiplied by or added to a constant (see Notes on Correlation, Part 2). But this entails that we would get the same correlation if we chose any other set of arbitrary values as alternatives to 0 and 1, provided the value of the heterozygote in the absence of dominance is half-way between that of the homozygotes. We can therefore obtain a quite general result for the correlation between the values of genotypes with and without dominance. (Of course, correlations could be calculated in a similar way on different assumptions about dominance, e.g. for partial dominance.) It can be shown by this method that Wright's results at the bottom of page 117 are correct, though I do not see how Wright derived his particular formulae, which are far from obvious. [As I have mentioned elsewhere, the equation p = root-uv appears to be a printing error or slip of the pen, as under Hardy-Weinberg equilibrium it should be p = 2root-uv. In fact, I now find that this error was listed in the printed Corrigenda to the relevant volume of Genetics but has not been corrected in the pdf copy.]

Systems of Mating I
I will conclude this note with some comments on Wright's most important paper on the subject: the first in the series on Systems of Mating (SM1).

Here Wright uses his method of path analysis to derive the correlation between relatives. In principle the ultimate result is a correlation between phenotypes, which should take account of all environmental and genetic influences, including dominance, epistasis, assortative mating, and shared environment (if any).

While the method of path analysis has some advantages for this purpose, which Wright emphasised, it also has some disadvantages. The variability among individuals is partly due to the chance effects of genetic recombination and segregation. It is therefore necessary for the path diagrams to contain an independent variable designated as 'chance' (see the diagram in SM1, p.116), which may be formally justified but still looks odd. More importantly, the method of path analysis assumes that the effects of causal influences can be simply added together. In genetics this is not always the case, as the effects of epistasis and dominance are not purely additive. Wright therefore excludes epistasis from his model 'for the present' (p.117). He does attempt to incorporate an adjustment for the effects of dominance, but this is not entirely successful. For the time being I will assume that the method is confined to the additive effects of genes.

It is not always clear what is the relevant population for the purposes of the correlations, especially as more than one generation of individuals are often involved in the correlations. Wright seems to assume (see the beginning of SM4) that in the absence of selection the proportions of different alleles in the total population will be constant, but in a finite population this cannot be strictly true, as there will be fluctuation due to genetic drift. Perhaps Wright is assuming for the purpose that the population can be regarded as indefinitely large. In this case it is legitimate to assume that gene frequencies in the absence of selection are constant. More seriously, it is not clear whether the intended reference population is the current population of each generation, the 'foundation stock' from which they are descended, or some combination of the two. Wright's reference to 'random mating' at the top of page 119 of SM1 would not make much sense if the intended reference population is the current one (of the parents), since f' would then always be zero.

Each path of descent is built up from the links between parent and offspring, so this relationship is especially important. In Wright's analysis (page 118-20) the direct relationship between parent and offspring can be analysed as a path with the following steps: parent's phenotype - parent's genotype - gamete (egg or sperm) - offspring's genotype - offspring's phenotype. (If the offspring's two parents have a non-zero correlation, an indirect path via the other parent also needs to be considered.) The path coefficients along the direct path from parent to offspring can be represented in the form hbah, where h represents the correlations between the phenotypes and genotypes of the parent and offspring (which may be different). The correlation coefficient can be considered a measure of broad heritability, that is, the extent to which the individual's phenotype is determined by the genotype. Its square, h^2, measures the proportion of phenotypic variance accounted for by genetic variance. This is historically the origin of the familiar use of h^2 to represent heritability. It should however be noted that Wright's usage is not quite the same as the modern one. In modern usage h^2 usually stands for narrow or additive heritability, measured by the extent to which the offspring predictably resemble the parents. Wright's h^2 is closer to the modern concept of broad heritability, as it measures the extent to which the phenotype of an individual is determined by its genotype. The key equation (p.116) is h^2 + d^2 + e^2 = 1, where h stands for all aspects of genetic heredity, and e and d stand for predictable effects of the environment and random fluctuations in development.

The coefficients a and b are the path coefficients representing, respectively, the contribution of the gamete (egg or sperm) to the variance in the genotype of the offspring, and the contribution of the parental genotype to the variance in the gametes. As none of these entities have a measurable phenotypic value, it is necessary to assume that they have arbitrary algebraic or numerical values, in the way discussed above. Wright's derivation of the values of a and b (SM1, pp.118-19) is particularly important, and needs to be carefully studied. Unfortunately it is not easy to follow. I would offer two tips. First, it is essential to refer frequently to the path diagram on page 116, without which the derivation would be unintelligible. Second, Wright does not explain why pG.H'' = rG.H'', which is crucial to the validity of the proof. I think it follows from the fact that the only causal path from the parental genotype to the gamete is the direct path pG.H''. [Added: having written this, I am pleased to find that Wright gives this explanation in another article.]

It should be noted that if the parents are unrelated and not inbred, a and b are both equal to root-1/2, so the product ab along the path from parent to offspring in this case equals one-half, as in Malecot's method.

It may perhaps be felt that Wright's derivation of the path coefficient b is a trick with smoke and mirrors. It is mathematically valid, but Wright's claim that 'in a sense, it is legitimate to reverse the arrows....' invites the response that in another sense it is not legitimate, since there is no causal influence from the gametes back to the gametocyte. This part of the proof therefore goes against the spirit if not the mathematical letter of path analysis.

At the top of page 120 Wright explains, very terseley, how correlations between relatives can be derived from the path coefficients. Again, it should be noted that in simple cases, and with perfect additive heritability, the results are the same as Malecot's. Wright then attempts to take account of dominance. As noted above, on page 117-8 of SM1 Wright gives formulae for the correlation between genotypic values with and without dominance. In the standard case of random mating the correlation comes out at root-1/1+p, where p is the proportion of heterozygotes in the population. To adjust the correlations between relatives to allow for dominance, Wright multiplies them by 1/1+p. He does not explain the logic behind this, but I think it is that each of the two correlated relatives has a genotypic value without dominance, which is the basis for the original correlation, and that these values can each be multiplied by root-1/1+p to give a typical adjusted correlation between the values with dominance. The effect is to reduce the correlation between the individuals by the factor 1/1+p. It may perhaps be wondered why only the two individuals at each end of the chain, and not the intermediate individuals, have their values adjusted. I think the explanation is that dominance is essentially an effect on phenotypes rather than genotypes, and in calculating the correlation between the individuals at the ends of the chain we need not take account of dominance effects on intermediate phenotypes any more than we need take account of environmental effects on them, since these do not affect the path coefficients along the chain.

Unfortunately Wright discovered, after reading Fisher's 1918 paper, that except in the case of half-siblings his own treatment of dominance effects was invalid, and in a footnote to his famous 1931 paper on 'Evolution in Mendelian Populations' he withdrew it. His original method therefore never satisfactorily covered epistasis and dominance. He later attempted to incorporate a revised treatment of dominance in his method of path analysis, but the result was very complicated. [See EGP vol 2., p435-6.] In this area Fisher's Analysis of Variance has been more generally used. The method of path diagrams remains very useful for the analysis of relationships, but the paths are now usually interpreted in Malecot's fashion as probabilities of Identity by Descent, and not as correlations.

The Problem of Negative and Zero Correlations
I emphasised earlier that in Wright's system the correlations between relatives, and therefore the measures of relatedness, can be zero or even negative. Yet it seems that Wright's actual procedures for measuring relatedness, by tracing path coefficients back through common ancestors, can only produce positive figures. For example, suppose that on average two randomly chosen members of a population have a degree of relatedness, measured by Identity of Descent within, say, the last thousand years, equivalent to that of full first cousins, i.e. a Malecot Coefficient of Relationship of one-eighth. On the face of it, if we trace back the paths of descent using Wright's methods, and work out the path coefficients, assuming complete additive heritability, the result will be a correlation of one-eighth, numerically equivalent to the Malecot coefficient. But the correlation coefficient between randomly selected members of a population, relative to that population as a whole, must be approximately zero. We therefore seem to have a contradiction.

It took me a while to see how this paradox can be resolved. I think the main explanation [see Note] is that in the usual applications of Wright's methods there is a tacit assumption that only the paths leading through common ancestors need be taken into account. All other paths can be regarded merely as background noise. For example, if we trace the paths between two full first cousins, we need only take into account the paths leading through the two grandparents they have in common, and not the other four grandparents, unless some of these lead back to other common ancestors in the fairly recent past. Ordinarily this is a reasonable approach, but it breaks down if it is is applied to the kind of case referred to in the last paragraph. If we trace back the entire ancestry of two randomly chosen individuals, for some large number of generations, the ancestors will have a mixture of positively and negative correlations between them. The positive and negative correlations will (approximately) cancel out. In a complete path analysis all these correlations would need to be taken into account, even if they do not involve a direct path through a common ancestor. When properly interpreted, Wright's methods therefore do not lead to a contradiction.

I had originally planned to go on to consider the extension of Wright's measures of kinship to the relations between populations, such as his well-known FST statistic. But the post is already long, so I will reserve the subject for another time.


Note: I say the main explanation , because the effect of common ancestry itself may also be reduced when we take account of negative correlations. For example, in the case of cousins with two common grandparents, these two grandparents may be negatively correlated, in which case the indirect path running through both of them would have a negative value. Or a common ancestor might have a negative coefficient of inbreeding (i.e. be less inbred than average for the population), which would reduce the path coefficient from parental genotype to gamete. But as far as I can see, these factors would never be sufficient to offset the positive correlations due to common ancestry entirely. It is therefore also necessary to take account of negative correlations between non-common ancestors.

Labels: , ,