Sunday, June 18, 2006

The French cross-fostering adoption study of IQ   posted by Darth Quixote @ 6/18/2006 10:38:00 PM
StumbleUpon Toolbar Digg Reddit Del.icio.us Ma.gnolia Newsvine

The now-classic adoption study by Capron and Duyme (1989, 1996) has once been offered to me as evidence for a strong effect of socioeconomic status (SES) on observed IQ differences among racial groups.

My own view is that closer scrutiny of the study tends to undermine this conclusion. For this reason I thought it worthwhile to cover it here in some depth. In any event this study is among the most important to be conducted during the waning pre-modern era of human behavioral genetics. I'm sure that most readers are at least passingly familiar with the massive body of evidence demonstrating a polygenic basis for the inheritance of IQ (e.g, Bouchard et al., 1990). But geneticists of such statute as Richard Lewontin and Oscar Kempthorne have dismissed this evidence, claiming that genetic causation cannot be shown in the absence of the cross-fostering designs typical of genetic research on plants and animals. We are thus fortunate that the French cross-fostering study, which satisfies these criticisms, has made it into the literature. As we await the coming integration of molecular and quantitative genetics, readers interested in the inheritance of mental ability would do well in the meantime to keep these results in a convenient cranny of their heads.

A helpful feature of the Capron and Duyme study is that the raw data are available in Appendix A of their 1996 paper. Both papers, as well as an important follow-up by Jensen (1997), have been uploaded to GNXP Forum.



As an exercise I have done all the data analysis myself. There are some differences between my results and the published ones. Unless stated otherwise, the differences are trivially small and of no practical importance.

The 38 subjects were selected from eight different French adoption agencies. All were relinquished at birth by their biological parentsand adopted within 13 months. The two factors used as selection criteria were SES of biological parents (B) and that of adoptive parents (A). The two levels chosen for each factor were separated as much as possible for optimal statistical power. Thus, only parents of the highest and lowest SES contributed to the variation among subjects, constituting a simple 2-by-2 design. Subjects were initially screened from all children born and relinquished to the agencies between 1 January 1970 and 31 December 1974. The search was then extended to 31 December 1980 to find more B+/A- subjects (high-SES biological parents and low-SES adoptive parents), as these are extraordinarily rare. In the final group, all cells except B+/A- contained 10 subjects; only 8 B+/A- subjects could be found.

The high-SES parents consisted of doctors, professors, students, and business executives. The low-SES parents consisted of farmers and unskilled workers. They averaged 16.1 and 6.6 years of education respectively. No subjects had any contact with their biological parents. Note that because of the less-than-unity slope of the regression of SES on IQ, the contrast in IQ between the sets of biological parents was probably not as wide as their contrast in SES.

At age 14 years, subjects were administered the French translation of the Wechsler Intelligence Scales for Children-Revised (WISC-R). The WISC-R consists of the following subtests:

  • Information. This is essentially a trivia test of general knowledge. The subject is asked questions such as, "What is the capital of Italy?"
  • Similarities. The subject is asked to state in what way two distinct concepts are alike. The more difficult items call for the articulation of fairly abstract commonalities: "What do poem and statue have in common?"
  • Arithmetic. The subject must solve arithmetic problems in his head.
  • Vocabulary. The subject must provide satisfactory definitions for a series of words.
  • Comprehension. The subjects must provide explanations for various facts, concepts, and actions. "Why do we vote?" "Why should you give money to organized charities rather than to beggars in the street?" "What should you do if you smell smoke in a crowded theater?"
  • Picture Completion. The subject must point out missing features in line drawings. An example of an easy item is a picture of a musician playing a cello without any strings.
  • Picture Arrangement. Each item consists of a disarranged series of cartoon panels that must be ordered to form a coherent story.
  • Block Design. Each item consists of a 2-D picture of some structure constructed out of cubes with different-colored faces. The subject is given actual 3-D cubes with different-colored faces and asked to reproduce the structure in the picture.
  • Object Assembly. The subject must work through a series of jigsaw-like puzzles.
  • Coding. The subject is given a booklet with an arbitrary code linking certain letters to certain numbers at the top of the page. A series of letters is printed on the pages. The subject must write the number called for by the code beneath each letter, completing as many as possible within the time limit.

Each subject was paired with a classmate from school. Both were then administered the WISC-R at school by a clinician unaware of the study aims. The tests were scored by a psychologist who was similarly unaware. Here are the descriptive statistics (means and then SDs parenthetically) of Full Scale IQ (the scaled sum of subtest scores):

B+/A+: 119.5 (12.0)
B+/A-: 107.4 (11.5)
B-/A+: 103.7 (12.6)
B-/A-: 92.4 (15.2)
all B+: 114.1
all B-: 98.1
all A+: 111.6
all A-: 99.1

In other words, the main effect of B (genetic and prenatal factors) is 15.4 points, and that of A (environmental factors associated with SES) is 11.7. The interaction of B and A is utterly insignificant (t = 0.109, p = 0.91). As you can see, the B effect is larger than the A effect, but not significantly so (bootstrap p = 0.25).

The A effect is still rather large. What accounts for the discrepancy between this result and American findings to the effect that shared rearing environment has little influence on IQ? One possibility is that estimates of environmental effects on IQ from American studies are biased downward because of range restriction in the quality of adoptive homes. Since adoption agencies try to find the best possible homes for their wards, it is possible that the American studies have inadequately sampled those environments found at the lowest level of SES that Capron and Duyme made an extraordinary effort to include in their study. Yet another possibility, argued by Judith Rich Harris in her excellent new book No Two Alike and her 10 Questions session with GNXP, is that an IQ correlation insignificantly different from zero among adoptive relatives and significant mean effect of SES on IQ each encompasses a subtly different phenomenon.

I myself favor an age effect. Recall that whereas IQ correlations for adoptive siblings hover around 0.20 when taken from children, they decline to near zero in mature adults. This is one observation supporting the generalization that the heritability of IQ increases with age. These French adoptees were tested at age fourteen. Perhaps that is not late enough for the effects of the unusually dramatic environmental intervention brought about in this study to have washed out. I suspect that in a follow-up study (say at age 25) we would have seen the cell means converge toward the means by SES of biological parents.

Some light is shed on the nature of the B and A effects by decomposing Full Scale IQ into subtest scores. Here are the effect sizes by subtest (SD = 3):

Information: 1.965/2.435
Similarities: 2.924*/1.577
Arithmetic: 2.659/1.141
Vocabulary: 3.300*/1.500
Comprehension: 2.477/1.324
Picture Completion: 0.877/0.924
Picture Arrangement: 1.618/0.782
Block Design: 3.112*/2.688
Object Assembly: 1.982/2.218
Coding: 0.877/1.924


Bolded effect sizes are significant at the 0.05 level. Asterisks indicate significance at the 0.005 level. No subtest showed a significant interaction between the B and A effects.

Those readers who are familiar with the Wechsler scales may notice an interesting trend in the effect sizes. Take Vocabulary and Block Design, the two subtests that are most sensitive to the B effect. These subtests are also the two that together correlate most highly with Full Scale IQ as a whole. This suggests that it is primarily g that is responsible for the heritability of IQ.

Recall that g is invoked to account for the fact that all mental ability tests are positively intercorrelated. In the common-factor model this is taken to mean that all mental ability tests measure, with varying degrees of sensitivity, a emergently unidimensional property of the brain that we call g. If it happens that a given non-psychometric variable (such as brain volume or heritability) has a privileged or casual relationship to g, then as a first approximation we might expect a positive association between the g loading of a test and the magnitude of its correlation with that variable. (Unfortunately, I cannot take readers beyond a first approximation of the answers to the questions that are now examined because the structual equation models commonly used nowadays for these purposes are over my head at the moment.)

What exactly is this B portion of the variance in IQ that parents transmit to their biological children intact through separation at birth and differences in SES? If it is g, then we would like to see a positive association between the g loading of the WISC subtest and its sensitivity to the B effect. The following are the g loadings of the WISC in the 38 French adoptees, the 1,868 white members of the American standardization sample, and the 305 black members of that sample:

Information: 0.82/0.67/0.65
Similarities: 0.77/0.67/0.62
Arithmetic: 0.62/0.57/0.60
Vocabulary: 0.75/0.72/0.71
Comprehension: 0.85/0.60/0.61
Picture Completion: 0.49/0.51/0.47
Picture Arrangement: 0.62/0.49/0.49
Block Design: 0.70/0.65/0.61
Object Assembly: 0.60/0.50/0.53
Coding: 0.33/0.37/0.36

Subtests showing a significant B effect are in bold. Subtests showing a significant A effect are in italics. The g loadings for the French sample is somewhat inflated because some awkward features of its correlation matrix forced me to use the first principal factor to represent g. The g loadings for the white and black American children, extracted using Schmid-Leiman heirarchical factor analysis, are taken from Jensen and Reynolds (1982). If you mentally shrink the French loadings, the data leave a rough impression of homogeneity across populations, especially given the smallness of the French sample. To put it more quantitatively, the Burt congruence coefficients among the three samples all exceed 0.99. Especially striking to my eye is the visibly unmistakable congruence of g loadings between American whites and blacks. This means that the WISC behaves very similarly in both groups; for a given increment of g, an observed subtest scores moves by the same amount regardless of whether the testee is white or black. Moreover, Dolan and Hamaker (2001) have found that in these data the intercepts of the regressions of g on observed subtest scores and the scatter around the regression lines can also be tenably modeled as invariant for whites and blacks. (Incidentally, measurement invariance in factor loading/slope, but not intercept and scatter, does seem to hold for age cohorts manifesting the Flynn Effect.) But this is somewhat of a peripheral issue at this point. The takeaway lesson for now is that there is no reason to reject the hypothesis that the WISC g loadings derived from the large American samples also apply to the French adoptees.

The quite striking feature of the g loadings is that the six subtests showing a significant B effect (Information, Similarities, Arithmetic, Vocabulary, Comprehension, Block Design) all have larger g loadings than the other four subtests. This pattern holds no matter which sample's g loadings are used. The A effect does not show this consistency. The strongly g-loaded Information and Block Design show significant A effects, but so do the weakly g-loaded Object Assembly and Coding. Of the four subtests that show stronger sensitivity to the A than to the B effect regardless of statistical significance, three (Picture Completion, Picture Arrangement, Coding) make up the bottom three in g loadings.

In his follow-up to the Capron and Duyme study, Arthur Jensen makes use of the adoptees' g factor scores to argue that g carries the bulk of the heritability of IQ. A person's factor score is his standing on the latent trait represented by the factor. You can think of a person's observed score X on, say, Vocabulary as roughly

X = gG + vV + uU + e

where G, V, and U are the person's respective factor scores on g, verbal ability, and some trait that uniquely affects Vocabulary; g, v, and u are the respective loadings of Vocabulary on g, the verbal factor, and its uniqueness; and e is measurement error and random noise. Dasen Luo, a referee of Jensen's paper, estimated the adoptees' factor scores on g using Bartlett's method, which aims to minimize the contamination of the estimates by non-g sources of variance. (Dr. Luo is good at math, but some might say that he's sort of boring. Bruce Lahn would love him.) Here is what is found using these estimated scores:


The mean standardized difference in g factor scores for the postnatal environmental effect (i.e., contrasting A+ and A- ...) is 0.129, t = 0.41, p = 0.68 (2-tail). The mean standardized difference for the biological effect (i.e., constrasting B+ and B- ...) is 0.861, t = 3.08, p = 0.004 (2-tail). In other words, the effect on the level of g of an extreme difference in SES environmental background is small and, in this study, even nonsignificant. In contrast, the effect on g of a difference in the SES level of the adoptees' biological parents is relatively large and highly significant.


I have been unable to duplicate these results. My own calculations using Bartlett's method on the Schmid-Leiman heirarchical g loadings from the white American standardization sample yield a much larger B effect on g factor scores, exceeding one standard unit. I also find a significant A effect as well, of approximately 0.7. Although the A effect on fancy g factor scores is somewhat smaller than its effect on vanilla IQ, both g factor scores and IQ are broadly similar in their responses to B and A. Why the discrepancy between my calculations and Dr. Luo's? I hesitate to impute error to Dr. Luo, but I admit that my own results make more intuitive sense to me. Deducing scores on latent factors from observed scores on manifest variables is an inverse problem that, strictly speaking, cannot be solved. The various methods of computing factor scores should thus be regarded as heuristic devices, each with a unique balance of advantages and disadvantages. With respect to estimating an individual's level of g, no method produces results that vary all that much from simply taking the unweighted sum of subtest scores to produce plain old Full Scale IQ. This is because of a theorem, proved by the psychometrician S.S. Wilks, that the correlation between two linear combinations of the same variables, where the weights are all positive, tends toward one as the number of variables increases. I cannot think of any reason why this should not hold here to prevent any reasonable weighting of subtest scores from providing insights that Full Scale IQ does not.

Readers with psychometric expertise (or who know matrix algebra and have handy a textbook on factor analysis) are invited to pore over the raw data and see if I have made some mistake.

To what degree do the profiles of genetic and environmental influence demonstrated by Capron and Duyme resemble the profile of black-white differences within the US? The Pearson product-moment correlation between sensitivity to the B effect and the magnitude of the black-white difference is 0.55 (p = 0.10); that between the A effect and the black-white difference is 0.10 (p = 0.78). Since these results are somewhat inconclusive, it is desirable to check their consistency with other findings. Luo, Petrill, and Thompson (1994) analyzed WISC data on 148 MZ and 135 same-sex DZ pairs recruited from public, private, and parochial schools in Cleveland and, using more or less the same logic that led Posthuma et al. (2002) to conclude that the correlation between brain volume and IQ is mediated entirely by genetic factors, partitioned the variance in scores among various genetic and environmental sources. Luo and his colleagues did not compare the profile of genetic and environmental influences to black-white differences, so I will do so here for the first time. The correlation between a subtest's heritability and the magnitude of its black-white difference is 0.57 (p = 0.08); that between its susceptibility to shared environment and the black-white difference is negative (r = -0.32, p = 0.34). It is possible in these data to distinguish between a genetic g and an environmental g; you can think of the former as the result of covariation among subtests arising from common genetic sources and the latter as similar covariation induced by features of the shared environment that broadly affect test performance. The correlation between a subtest's loading on genetic g and the magnitude of its black-white difference is 0.70 (p = 0.02); that between its loading on environmental g and the black-white difference is 0.04 (p = 0.91). So, in this respect at least, things look much the same when you go from France to Cleveland.

So, my concluding impressions:

  • The results of the French cross-fostering adoption study strongly support the consensus that heredity accounts for a substantial portion of the variance in mental test scores.
  • It is certainly possible that substantial environmental effects can be found where the range of SES is unrestricted. It is unclear, however, that the effects demonstrated here in particular are stable over time. Longitudinal follow-up would have been informative.
  • It may be that the genetic effect on mental test scores is concentrated in the g factor. This is important for many reasons, including the general agreement that g accounts for the bulk of the predictive validity of IQ with respect to practical, real-life criteria. More rigorous approaches to this question than my own may be possible.
  • These results, while by no means deciding the matter, also leave entirely intact the reasonableness of the hypothesis that genetic differences account for as much of the black-white IQ gap as do environmental factors.