Friday, September 07, 2007

Notes on Correlation: Part 1   posted by DavidB @ 9/07/2007 02:28:00 AM
Share/Bookmark

Several months ago I promised a series of posts on the work of Sewall Wright. This has been delayed for various reasons, including some time spent on this.

Another problem is that understanding Wright's work requires some knowledge of the theory of correlation and regression. Wright was accustomed to express his theories in terms of correlation, even in cases where other methods would now be preferred. For example, Malecot's interpretation of kinship in terms of identity by descent has completely replaced Wright's correlational approach.

The problem is that I don't want to presuppose any great knowledge of correlation and regression by the reader. Statisticians may take these things for granted, but it is salutary to note that George Price, when he began his work on altruism, 'didn't know a covariance from a coconut'. Unfortunately I don't know of a good online source containing everything the reader needs to know as background to Wright. The various Wiki articles on correlation seem too mathematically sophisticated for the general reader. Nor do I know of any modern statistical textbook that I can recommend, as I find that in general they are either too advanced or too elementary.

For my own purposes I prefer some classic older works. My 'bible' for statistics is George Udny Yule's Introduction to the Theory of Statistics. (From the 11th edition onwards this was co-authored by Maurice Kendall. I have the 14th edition, 1950.) Quinn McNemar's Psychological Statistics (2nd edition, 1955) is also very good. J. P. Guilford's Fundamental Statistics in Psychology and Education (2nd edn., 1950) has a good text, but seldom gives proofs of formulae.

For the historical aspects of correlation and regression I particularly recommend Stephen Stigler's The History of Statistics: The Measurement of Uncertainty before 1900 (1986) and Theodore Porter's The Rise of Statistical Thinking 1820-1900 (1986). It is also still worthwhile to look at some of Francis Galton's original papers on correlation and regression, which are all available here. For general readers the best is probably his 1890 article on 'Kinship and Correlation'.

From time to time I have made notes of my own on various aspects of correlation and regression, largely to ensure that I do not forget things I have worked out for myself. It occurred to me that if I strung these notes together with some linking commentary, they might provide the necessary background for my posts on Wright. This has proved harder than I expected. It is really difficult to treat these subjects in a way that is clear, concise, self-contained, unambiguous, and not too mathematically complicated. It is all-too-likely that some errors have crept in, so please let me know if you spot any.

I have divided the notes into three parts. The first part contains preliminaries about notation, etc, and introduces the concepts and main properties of bivariate (2-variable) correlation and regression. Part 2 will prove some important theorems and discuss some problems of interpretation. Part 3 will cover the basics of correlation and regression for more than two variables.

Even divided into these chunks the notes are long, and I do not imagine that anyone will want to read them all in one go. Hopefully they will be useful to people searching the web on these subjects. I will link back to them if and when I get round to the notes on Wright.

So here is Part 1:

[Added on Sunday 9 September] I have gone to some trouble to represent standard mathematical symbols in this blog publishing system, in such a way that they are readable by all common browsers. I thought I had succeeded, and it all looked OK when I posted it. But as of now (9.00 GMT) some nonsense symbols are appearing in my own browser! This may be a temporary problem (the publishing system sometimes goes haywire) but if not, I will try and fix it.]

[Added again: I don't know what the problem is, but the symbols are still unreadable in my browser, even though a short test post worked OK. So in case the problem continues, I have added a 'plain text' version which does not use math symbols. If you can't read the original version properly, scroll down to 'Version 2 (plain text)'. I have also taken the opportunity to add an explanatory remark about 'dependent' and 'independent' variables.]

[Added on Tuesday: I have now deleted the original version of the Notes, as it seems that most people could not read the math symbols, and it was just taking up space. I take this opportunity to emphasise that the Notes are not aimed at expert statisticians, and do not claim to deal with the most up-to-date issues in the theory of regression and correlation. On the other hand, I do believe that the issues discussed by such founders of statistics as Francis Galton, Karl Pearson, Udny Yule, R. A. Fisher, and Sewall Wright (whose statistical work has been somewhat neglected) are still interesting and important, and that modern students neglect the 'old masters' at their peril.]




Preliminaries

Unless otherwise stated, these notes deal only with the linear regression and correlation of two variables. Questions of sampling and measurement error will not be covered.

I assume that there are two sets of observations or measurements, represented by the variables x and y, with N items in each set, paired with each other in some way. In dealing with the correlation or regression between two sets of data, we must have some particular pairing in mind: obviously different pairings could produce different results. There is no limit on the kind of relationships that can be taken into account, provided there is a one-to-one correspondence between the items in the two sets. In some cases the same observation may be counted more than once: e.g. if we want to correlate the height of fathers and sons, some fathers may have more than one son, in which case the father's height may be counted as a separate item in relation to each son. It is assumed that all items have a numerical value. Usually this will be the result of some process of counting, measurement, or rank-ordering, but numbers can also be assigned to qualitative characters according to some arbitrary rule. For example, in order to calculate a correlation between siblings with respect to eye colour, blue eyes might be given the value 1 and brown eyes the value 2, or vice versa.

In dealing with the theory of correlation and regression it is often convenient to express quantities in each set of values as deviations above or below the mean of the set. So, for example, if the x variable denotes measurements of human heights, and the mean height in the population is 68 inches, then a height of 65 inches can be represented by a deviation value of -3, and a height of 70 inches by a deviation value of 2. In reading any text on correlation or regression, it is important to note whether the author uses raw values or deviation values in his formulae. I will often use deviation values, as this results in simpler algebraic expressions.

Notation

This version of the notes uses only non-mathematical typography. Large S will represent the sum of a set of quantities. Small s will represent the standard deviation of a set of quantities. V will represent the variance (the square of the standard deviation). ^2 will represent the square of a quantity. # will represent the square root of a quantity. Subscripts will be represented by a low-level dash, e.g. x_1 would represent x with the subscript 1. The sum of each x value multiplied by the corresponding y value will be denoted by Sxy. The sum of each x value plus the corresponding y value will be denoted by S(x + y). The mean of the raw x values (their sum divided by N) will be denoted by M_x, and the mean of the raw y values by M_y. The deviation values of x and y are then (x - M_x) and (y - M_y) respectively. The following points follow from these definitions and the standard rules of algebra:

(a) S(x + y) = Sx + Sy

(b) Sxy is not in general the same as SxSy (the sum of all x values multiplied by the sum of all y values).

(c) Sx^2 (the sum of the x values individually squared) is not in general the same as (Sx)^2 (the square of the sum of the x values).

(d) Where 'a' is a constant, S(xa) = aSx.

(e) S(x + a) = Sx + Na (since the constant enters into the sum N times, once for each value of x.)

(f) S[(x + a)(y + b)] = Sxy + bSx + aSy + Nab. Here the scope of the summation sign is the entire expression in square brackets. The intended interpretation is that each term (x + a) is to be multiplied by the corresponding (y + b), where a and b are constants. Note that the resulting sum includes the product ab N times, once for each pair of x and y values.

(g) S(x + a)^2 = Sx^2 + 2aSx + Na^2

(h) M_x = Sx/N, and M_y = Sy/N, where x and y represent raw values.

(i) It follows from (h) that the deviation values of x can be expressed in the form (x - Sx/N). The sum of the deviations is therefore S(x - Sx/N). But by point (e) this sum is equivalent to Sx - NSx/N = 0. Likewise for y. Therefore if we use x and y to represent deviation values, instead of raw values, then Sx = Sy = 0.

The variance of the x values will be denoted by Vx, and of the y values by Vy. By the definition of variance Vx = [S(x - M_x)^2]/N, where x represents raw values. If x and y are already expressed as deviation values, Vx = (Sx^2)/N and Vy = (Sy^2)/N. It follows that with deviation values NVx = Sx^2 and NVy = Sy^2. The standard deviation of x will be denoted by sx, and of y by sy. By definition sx is #Vx, and sy is #Vy. (Strictly, though not all texts are explicit on this point, the standard deviation is the positive square root of the variance, otherwise there would be an ambiguity of sign in many formulae.)

The meaning of regression and correlation

The usual explanation of regression and correlation is something like this:

Regression provides a method of estimating or predicting the value of one variable given the corresponding value of the other variable. The latter variable is multiplied by a coefficient of regression to contribute to the best estimate of the first variable. [Added: The variable whose value we wish to estimate is usually called the dependent variable, and the other the independent variable, but these terms do not imply a direct causal relationship between them, or that if there is a casual relationship the causation runs from the independent to the dependent variable. 'Dependent' is merely a conventional term to designate the term whose value we want to estimate.] If we denote the coefficient of regression by b, the equation x = a + by (where a is a constant, which may be positive, negative, or zero) provides the best estimate of x given the corresponding value of y (the regression of x on y). The term 'regression' itself is an unfortunate historical accident arising from the specific biological context in which the concept was originally formulated by Francis Galton. Alternative terms have sometimes been suggested, but did not catch on.

Of course, we do not always literally want to estimate or predict the value of a variable. The value may already be known, in which case prediction would be unnecessary, or easily measured, in which case a mere estimate would be a poor substitute. The use of regression is more often in connection with hypotheses. We may want to formulate a hypothesis about the general relationship between two variables, or we may already have such a hypothesis and want to test it. Calculating a regression coefficient may suggest such a hypothesis, or put an existing one to the test.

The regression coefficient of x on y need not be the same as the regression coefficient of y on x. Regression is not a symmetrical relation. Correlation, on the other hand, is a measure of the closeness of the relationship between the x and y values. Correlation is symmetrical, since x is as closely related to y as y is to x. Another way of putting it is that regression gives the best estimate of one variable given the other, while correlation measures how good the estimate is. (This is not to be confused with the question how reliable it is, in the sense of how much it varies when different samples are taken from the same population. This is a matter for the theory of sampling, which will not be dealt with here.) The closer the relationship, as measured by the correlation coefficient, the better the estimate. A positive correlation implies that high values of one variable tend to go together with high values of the other, while a negative correlation implies that high values of one variable tend to go together with low values of the other. A zero or near-zero correlation implies that the relationship between the variables is no closer than would be expected by chance.

The Pearson Formulae

The standard formulae for linear regression and correlation were devised in the 1890s by Karl Pearson (partly anticipated by Francis Ysidro Edgeworth). They are often known as the Pearson product-moment coefficients. (Other formulae, such as the intraclass correlation coefficient, the tetrachoric correlation coefficient, or rank order correlations, may be used for certain special purposes. In general these are modifications of the Pearson formulae rather than fundamentally different approaches.) A simplified derivation of the formulae was introduced by George Udny Yule.

Using the notation explained above, the Pearson coefficient of regression of x on y can be expressed as Sxy/Sy^2, or equivalently Sxy/NVy, where the x's and y's are deviation values . The 'best estimate' for the value of x given the corresponding value of y is therefore x = a + (Sxy/Sy^2)y. (In fact, when deviation values are used, a = 0.)

The coefficient of regression of y on x can be expressed as Sxy/Sx^2, or equivalently as Sxy/NVx.

It will be noted that both regression coefficients contain the term Sxy/N. This is known as the covariance of x and y, or cov_xy. The regression coefficients can therefore also be expressed as cov_xy/Vy and cov_xy/Vx.

The coefficient of correlation between x and y is Sxy/Nsx.sy, or equivalently cov_xy/sx.sy, where x and y are deviation values. The correlation coefficient is traditionally designated by the letter r, and the correlation between x and y as r_xy. Note that Sxy = r_xyNsx.sy.

The use of r (and not c) for the correlation coefficient is another historical accident. It originally stood for 'reversion' in an 1877 paper by Galton. In 1885 Galton adopted the term 'regression', and kept the abbreviation r, even when he later subsumed his concept of regression in the broader concept of correlation (see further below). Karl Pearson and other statisticians continued to use r for correlation, and the usage is too deeply rooted to change.

There does not seem to be any universal abbreviation for the coefficients of regression. Small r is pre-empted for the correlation coefficient, and large R is often used for another purpose (the multiple correlation coefficient). Some authors use Reg. or reg. to indicate regression, but this seems clumsy. The letters B, b, or beta are sometimes used for regression coefficients, and I will use b. Since regression is not in general symmetrical, it is necessary to distinguish between b_xy and b_yx, where b_xy is the coefficient of the regression of x on y, and b_yx the coefficient of the regression of y on x.

There is a close mathematical relationship among the coefficients of correlation and regression, which all contain the term cov_xy, and can be converted into each other by multiplying or dividing by sx and sy. For example, the coefficient of the regression of x on y, b_xy = Sxy/NVy, can be expressed as (r_xy)sx/sy, and the coefficient of the regression of y on x, b_yx = Sxy/NVx, as (r_xy)sy/sx. It follows that b_xy = b_yxVx/Vy. The correlation coefficient r_xy is the 'mean proportional' between the regression coefficients, #(b_xy)(b_yx).

The regression and correlation coefficients can be expressed in a variety of other equivalent formulae. Some authors use expressions with raw values of x and y, rather than deviation values, in which case the correlation coefficient takes the form
S(x - M_x)(y - M_y)/Nsx.sy.
It can be shown that
S(x - M_x)(y - M_y)/Nsx.sy
= (Sxy - NM_x.M_y)/Nsx.sy
so this formula is also often used. Another equivalent formula for raw values is (NSxy - SxSy)/#[NSx^2 - (Sx)^2]#[NSy^2 - (Sy)^2].
Some authors (especially on psychometrics) also assume that all quantities are expressed with their own standard deviation as the unit of measurement, in which case sx = sy = 1, and the coefficients of correlation and regression all reduce to cov_xy. Formulae are also sometimes modified to allow for sampling error.

The choice of the formula to use depends on the purpose. For theoretical purposes it is generally simplest to use formulae with deviation values. For example, if we square the correlation coefficient for deviation values, Sxy/Nsx.sy, the numerator of r^2 is (Sxy)^2, whereas if we squared it in the form for raw values, S(x - M_x)(y - M_y)/Nsx.sy, the numerator in its simplest expression would be (Sxy)^2 - 2(Sxy)NM_x.M_y + N^2M^2_x.M^2_y. Further work involving r^2 in this form could get very messy, e.g. if several such items have to be multiplied together. If on the other hand we needed to calculate the actual value of a coefficient from empirical data, one of the raw-value formulae would be more convenient. But the need for this now seldom arises, as correlation and regression coefficients can be calculated from raw data even by modestly priced pocket calculators.

What is a 'best estimate'?

As noted above, the regression of x on y is usually described as producing the 'best' estimate of x given y. But in what sense is it the best? The texts usually just say that it is the estimate given by the method of least squares, that is, the estimate that minimises the sum of the squares of the differences between the estimated values and the observed values, or the 'errors of estimation'. The criterion of 'least squares' was taken over by Pearson and Yule from the theory of errors, as used in astronomy, geodesy, etc, for determining the best estimate for a physical value (e.g. the true position of a star) given a number of imperfect observations. But why is this the best estimate?

There are some good practical reasons for using the method of least squares. An estimate based on least squares is unbiased, in the sense that it has no systematic tendency to over- or under-estimate the true value, and it makes use of all available information (unlike, say, an estimate using median or modal values). It is also stable, in the sense that a small change in the observations does not produce a large change in the estimate. The method has various more technical advantages. But beyond this, many 19th century mathematicians regarded it as giving the most probable value of the unknown true quantity. This requires assumptions to be made about the distribution of prior probabilities. If it can be assumed that all possible values of the true quantity are equally probable a priori, then the method of least squares gives the most probable true value after taking account of the observations. [Note 1] But the assumption of equal prior probabilities would not nowadays be generally accepted as valid, in the absence of any empirical evidence about the distribution of probabilities. And even if it were, it is not clear that a method devised for the estimation of physical quantities, which have a single 'true' value, can legitimately be used to estimate a variable trait of a population.

However, the method seldom gives intuitively implausible results. One advantage is that if we want to make a single estimate for any set of numbers, the estimate given by the method of least squares is simply their mean. Suppose that the x's designate the raw values of a set of N numbers. If we take the mean value M_x as an 'estimate' of the x's, the sum of squares of the 'errors of estimation' is then
S(x - M_x)^2 = Sx^2 - 2Sx.M_x + NM^2_x = Sx^2 - NM^2_x.
For any other value of the estimate, greater or less than the mean by an amount d, it will be found after a bit of algebra that the sum of squares of the 'errors' equals Sx^2 - NM^2_x + Nd^2, which is greater than the earlier estimate by Nd^2. But Nd^2 is necessarily positive, for any non-zero value of d, so the sum of squares of the errors is greater than when the estimate is the mean. The mean is therefore the 'least squares' estimate. The main practical weakness of the least squares method is that it may give too much weight to 'outliers' - freak extreme values which add disproportionately to the sum of squares.

In the case of a regression equation, we do not want a single estimate for the whole set of observations, but the best estimate of the dependent variable for the corresponding value of the independent variable. If there are several values of the dependent variable corresponding to a narrow range of the independent variable, then we can plausibly regard the mean of those values as the best estimate associated with that particular range. The regression equation based on the Pearson formula does in fact give a close approximation to that mean value, provided samples are reasonably large, and the distribution of both variables is approximately normal.

It might be supposed that the 'best' estimate would be the one that minimises the absolute size of the errors (disregarding sign), rather than the squares of the errors. However, the absolute size of the errors is usually more difficult to calculate, and it can lead to intuitively odd results. [Note 1] 'Least squares' therefore prevails, despite the lack of any conclusive justification for the method. As Yule remarks, 'the student would do well to regard the method as recommended chiefly by its comparative simplicity and by the fact that it has stood the test of experience' (p.343).

Having determined the regression coefficients by the method of least squares, the 'goodness' of the resulting estimates may be determined by calculating the differences between the estimated and observed values. The smaller the differences, the better the estimate. The correlation coefficient provides a means of quantifying the 'goodness' of the estimates. It can be shown that its value ranges between 1 and - 1, and that the 'goodness' of the estimate is the same whichever variable we take as given.

Correlation, regression and 'common elements'

For those who are still uneasy about the justification in terms of least squares, there is an alternative or supplementary interpretation of correlation and regression which is closer to Francis Galton's original conceptions and may be intuitively easier to grasp, though it does not seem to be popular with professional statisticians. If we imagine the corresponding items in the two correlated sets of variables to contain certain common elements, or to be influenced by certain common causes, then it is plausible that the degree of similarity between them should reflect the proportion of common elements or causes. Subject to certain assumptions, if the common elements account for a proportion A of all the elements of the x variable, and a proportion B of all the elements of the y variable, then the correlation between x and the common elements will be #A, the correlation between y and the common elements will be #B, and the correlation between x and y will be #(AB). If A = B, the correlation between x and y is therefore simply A. As a familiar example, full siblings on average have half their genes in common, so if their phenotypes are determined entirely by additive genes, the correlation between them is .5. Admittedly, a rigorous proof of the correlation formula based on 'common elements' requires various assumptions to be made about the size of the common elements, absence of other correlations, and so on, but purely for conceptual purposes - as a supplement to the 'least squares' approach - no great rigour is needed. The essential point is that if two variables are influenced by common causes or common elements, then we would expect their values to be related in much the same way as shown in the formal theory of correlation, without appealing to the method of least squares .

In the 'common elements' approach, correlation is taken as the fundamental relationship, and regression is derivative, rather than the other way round. If we know that there is a correlation r between two sets of variables, reflecting a certain proportion of common elements, then we will expect a given deviation in one variable to be matched by some deviation in the other variable, though it will be diluted by the 'non-common' elements on both sides. So we would expect to be able to 'estimate' the value of one variable from the value of the other, in the sense that a deviation in x will correspond, on average, to a certain proportionate deviation in the y variable. This proportionate deviation will not necessarily be the same as the correlation coefficient itself. The same causes may have a greater effect on one variable than on the other, and the two variables (e.g. temperature and rainfall) may not be measured in the same units, so there could also be an arbitrary factor of scaling to be reflected in the regression coefficients.

Galton's original conceptions are relevant here. Galton did not conceive of regression as a means of estimating the value of one variable from another. In the 1870s and 1880s he was trying to discover quantitative laws of biological inheritance. He noticed that offspring tended to resemble their parents, but not perfectly. If the parents were markedly different from the mean of the population, the offspring would tend to be intermediate between the parents and the population mean, or as Galton put it they would 'revert' or 'regress' towards the mean. By 1885 he had developed a procedure for measuring the regression of one variable on another, which he saw primarily as a measure of resemblance and difference based on the proportion of hereditary elements two individuals had in common. Since Galton was dealing with cases where the units of measurement and the variability of the two variables were the same, or could be easily adjusted on an ad hoc basis (e.g. by converting female heights to their male equivalents), he found his techniques sufficient for measuring resemblance.

In the late 1880s Galton wanted to measure the resemblance of different parts of the body, and encountered the problem that the size and variability of different parts of the body (e.g. fingers and thighbones) may be very different. In this case the regression coefficients by themselves are not much use as a measure of resemblance. Galton hit on the solution of rescaling all the variables into units based on their own variability (in modern terms their standard deviation, though Galton used a slightly different measure). He then realised that he had discovered a principle, which he called correlation, of wider generality than his original concept of regression. Rather than seeing regression and correlation as sharply different concepts, he saw correlation as a an extension and generalisation of regression as a measure of resemblance. It was Yule, around 1900, who pioneered the modern approach which distinguishes more sharply between the two concepts.

But in some ways Galton's approach may still be useful. This can be illustrated using Pearson's formulae. In most texts the close mathematical relationship between the Pearson coefficients of correlation and regression is mentioned but not really explained. The coefficient of regression of x on y is Sxy/Sy^2. But we can rescale the x and y measurements by dividing them all by their own standard deviations. Sxy/Sy^2 then becomes [S(x/sx)(y/sy)]/S[(y/sy)^2]. This can be re-arranged as [Sxy/(sx.sy)]/(Sy^2/Vy) = (Sxy/(sx.sy))/N = r_xy.
In exactly the same way the regression of y on x can be rescaled, and also comes out as equal to r_xy. Thus the correlation coefficient can be interpreted in Galton's sense as a standardised regression coefficient (or the regression coefficients can be regarded as the correlation coefficient modified by differences of scale, etc), and the close mathematical relationship between the coefficients is no longer so mysterious.

Note 1: Actually, this is not true unless certain further assumptions are made about the nature and distribution of errors, notably that there is no tendency for errors to go in the same direction. This is often false. The method of least squares does not eliminate the need to detect and remove any systematic sources of error.

Note 2: As a simple example, suppose we have three observations, A, B, and C, of a point in 2-dimensional Euclidean space. We wish to find the 'best' estimate for the true position of the point, using the method of least squares. We therefore want to find a point P such that the sum of the squares of the distances AP, BP, and CP is minimised. First, take a Cartesian coordinate system with axes x and y, and find the coordinates of the points A, B, and C. Let us call these coordinates x'A, y'A for point A; x'B, y'B for point B; and x'C, y'C for point C. We now want to find the coordinates x'P and y'P of the required point P. Thanks to Pythagoras' Theorem, the square of a distance between two points in 2-dimensional Euclidean space is equal to the sum of the squares of the distances between the coordinates of the points along the x and y axes. Therefore the squares of the distances AP, BP, and CP are:

AP^2 = (x'A - x'P)^2 + (y'A - y'P)^2

BP^2 = (x'B - x'P)^2 + (y'B - y'P)^2

CP^2 = (x'C - x'P)^2 + (y'C - y'P)^2

We need to find the values of x'P and y'P for which the total sum of these squares is minimised. Since the value chosen for x'P does not affect the value of the squares involving y'P (or vice versa), the total sum of squares will be minimised if we can minimise the sums of squares involving the x and y coordinates separately. This can be done simply by taking the mean value of x'A, x'B, and x'C as the value of x'P, and the mean value of y'A, y'B, y'C for y'P. (The mean of a set of numbers is the value that minimises the square of the 'errors', which in this case are the distances between the relevant coordinates.) The least squares estimate is therefore obtained by a simple process of averaging, and gives a point at the centre of gravity of the triangle, which seems intuitively satisfactory. The same method can easily be extended to find the best estimate for the position of a point from any number of observations in a Euclidean space with any number of dimensions. In contrast, to minimise the sum of the absolute value of the distances (in this case, to minimise AP + BP + CP), there is no simple general method. A variety of ad hoc solutions are needed, which may involve difficult geometrical problems and produce intuitively unsatisfactory results. For example, if we have a triangle with one angle greater than 120 degrees, then the vertex at that angle is itself the required point, regardless of the position of the other vertices ('Steiner's Problem'.)