Tuesday, February 24, 2009

Male superiority at chess and science cannot be explained by statistical sampling arguments   posted by agnostic @ 2/24/2009 05:37:00 PM

A new paper by Bilalic et al. (2009) (read the PDF here), tries to account for male superiority in chess by appealing to a statistical sampling argument: men make up a much larger fraction of chess players, and that the n highest extreme values -- say, the top ranked 100 players -- are expected to be greater in a large sample than in a small one. In fact, this explanation is only a rephrasing of the question -- why are men so much more likely to dedicate themselves to chess.

Moreover, data from other domains where men and women are equally represented in the sample, or where it's women who are overrepresented in the sample, do not support the hypothesis -- men continue to dominate, even when vastly underrepresented, in domains that rely on skills that males excel in compared to females. I show this with the example of fashion designers, where males are hardly present in the sample overall but thrive at the elite level.

First, the authors review the data that male chess players really are better than female ones (p.2):

For example: not a single woman has been world champion; only 1 per cent of Grandmasters, the best players in the world, are female; and there is only one woman among the best 100 players in the world.

The authors then estimate the male superiority at rank n, from 1 to 100, using the entire sample's mean and s.d., and the fraction of the sample that is male and female. Here is how the real data compare to this expectation (p.2):

Averaged over the 100 top players, the expected male superiority is 341 Elo points and the real one is 353 points. Therefore 96 per cent of the observed difference between male and female players can be attributed to a simple statistical fact -- the extreme values from a large sample are likely to be bigger than those from a small one.

Therefore (p. 3):

Once participation rates of men and women are controlled for, there is little left for biological, environmental, cultural or other factors to explain. This simple statistical fact is often overlooked by both laypeople and experts.

Of course, this sampling argument doesn't explain anything -- it merely pushes the question back a level. Why are men 16 times more likely than women to compete in chess leagues? We are back to square one: maybe men are better at whatever skills chess tests, maybe men are more ambitious and competitive even when they're equally skilled as women, maybe men are pressured by society to go into chess and women away from it. Thus, the question staring us in the face has not been resolved at all, but merely written in a different color ink.

The authors are no fools and go on to mention what I just said. They then review some of the arguments for and against the various explanations. But this means that their study does not test any of the hypotheses at all -- aside from rephrasing the problem, the only portion of their article that speaks to which answer may be correct is a two-paragraph literature review. For example, maybe females on average perform poorer on chess-related skills, and so weed themselves out more early on, in the same way that males under 6'3 would be more likely to move on and find more suitable hobbies than basketball, compared to males above 6'3. Here is the authors' response to this hypothesis (p. 3, my emphasis):

Whatever the final resolution of these debates [on "gender differences in cognitive abilities"], there is little empirical evidence to support the hypothesis of differential drop-out rates between male and females. A recent study of 647 young chess players, matched for initial skill, age and initial activity found that drop-out rates for boys and girls were similar (Chabris & Glickman 2006).

Well no shit -- they removed the effect of initial skill, and thus how well suited you are to the hobby with no preparation, and so presumably due to genetic or other biological factors. And they also removed the effect of initial activity, and thus how enthusiastic you are about the hobby. And when you control for initial height, muscle mass, and desire to compete, men under 6'3 are no more or less likely to drop out of basketball hobbies than men over 6'3. How stupid do these researchers think we are?

So, this article really has little to say about the question of why men excel in chess or science, and it's baffling that it got published in the Proceedings of the Royal Society. The natural inference is that it was not chosen based on how well it could test various hypotheses -- whether pro or contra the Larry Summers ideas -- but in the hope that it would convince academics that there is really nothing to see here, so just move along and get home because your parents are probably worried sick about you.

Now, let's pretend to do some real science here. The authors' hypothesis is that the pattern in chess or science can be accounted for by their statistical sampling argument -- but of course, men dominate all sorts of fields, including where they're about as equally represented in the pool of competitors, and even when they're outnumbered in that pool. Occam's Razor requires us to find a simple account of all these patterns, not postulating a separate one for each case. The simple explanation is that men excel in these fields due to underlying differences in genes, hormones, social pressures, or whatever.

The statistical sampling argument can only capture one piece of the pattern -- male superiority where males make up more of the sample. Any of the non-sampling hypotheses, including the silly socio-cultural ones, at least are in the running for accounting for the big picture of male dominance regardless of their fraction of the sample.

To provide some data, I direct you to an analysis I did three years ago of male vs. female fashion designers. Here, I'll consider "the sample of fashion designers" to be students at fashion schools since that's what the data were. Fashion students are the ones who will make up the pool of fashion designers upon graduating. I included four measures of eminence: 1) being chosen to enter the Council of Fashion Designers of America, 2) having an entry in two major fashion encyclopedias, both edited by women (Who's Who in Fashion, and The Encyclopedia of Clothing and Fashion), 3) having their collections listed on Vogue's website, and 4) winning the highest award of the CFDA, the Perry Ellis awards for emerging talent.

The male : female ratio in the pool of fashion students is 1 : 13 at Parsons and 1 : 5.7 at FIT. So, the female majority in the sample of fashion designers is not quite as extreme as that of males in chess leagues, but pretty close. The statistical sampling argument predicts that females should out-number males at the top. But they don't -- the M : F ratios for the four measures above are, respectively, 1.29 : 1, 1.5 : 1 and 1.9 : 1, 1.8 : 1, and 3.6 : 1. Again, this isn't as extreme as male superiority in chess, but recall that males are so underrepresented in the sample to begin with!

(For other design fields that males tend to have greater interest in, such as architecture, the M : F ratios among the winners of the Pritzker Prize and the AIA Gold Medal are, respectively, 27 : 1 and 61 : 0).

The authors statistical sampling argument is not a null hypothesis that we reject or fail to reject in any particular case -- rejecting it in fashion design, and failing to reject in chess. It is not a hypothesis at all, but simply a rephrasing of the observation that men dominate certain fields, only measuring this by their greater participation rates. Again, it does not address why males are so much more likely to participate in chess leagues to begin with, which could be due to any of the existing hypotheses about male superiority. The point is that it is a widespread phenomenon that requires a single explanation applying across domains.

I find the genetic and hormonal influences on the mean and variance of cognitive ability and personality traits to be the most promising (just search our archives for relevant keywords to find the discussions). But this study of chess players offers nothing new to the debate, and could not do so even in principle, as it doesn't make a novel hypothesis, apply a novel test to existing data, or apply existing tests on novel data. You can reformulate the observation or problem however you please, but that doesn't make the testing of hypotheses go away.


Bilalic, Smallbone, McLeod, and Gobet (2009). Why are (the best) Women so Good at Chess? Participation Rates and Gender Differences in Intellectual Domains. Proc. R. Soc. B, 276, 1161–1165.

Labels: , , ,

Wednesday, April 04, 2007

New GRE cancelled - the cost of attempted gap-reduction?   posted by agnostic @ 4/04/2007 11:38:00 PM

The NYT reports that a completely revised GRE has been deep sixed, not merely delayed (read the ETS press release here). The official story is that there is some insurmountable problem with providing access to all test-takers, an issue apparently too complicated for ETS to bother trying to explain it to us. You figure, since this was such a huge project that was suddenly halted, they'd want to clearly spell out why they dumped it -- unless that's the point. Although I'm no mind-reader, the true reason is pretty obvious: the made-over test was designed to narrow the male-female gap at the elite score level, but this diluted its g-loadedness such that it couldn't reliably distinguish between someone with, say, a 125 IQ and a 145+ IQ, which is what graduate departments who rely on super-smart students worry about. Rather than admit that this psychometric magic trick went awry and lopped off a few limbs of g-loadedness, they spun a yarn about access to the test. [1]

To put this in perspective, for those who took the SAT before spring 2005 -- which is everyone here, I assume -- the New SAT now includes a Writing test with both multiple choice grammar questions and a 25-minute persuasive essay. No admissions committee is paying serious attention to this silly addition, although high schoolers obsess over it. The real changes are that the Math test no longer includes the "quantitative comparison" questions (column A, B, equal, can't tell?), and the flavor of the questions is a bit more "book smarts"-based than before. Also, the Verbal test (now called Critical Reading) has zero analogies, fewer sentence completions, and much more passage-based reading. The gutted portions are those that are more highly g-loaded, for sure in the Verbal test, and most likely in the Math test as well. [2]

We now ask why ETS intentionally stripped the SAT of some of its g-loadedness? Certainly not because they discovered IQ had little value in predicting academic performance, or that some items tap g more directly than others -- so why re-invent the wheel? Since scores on various verbal tasks highly correlate, this change cannot have affected much the mean of any group of test-takers. But if getting a perfect score required scoring correctly on, say, 10 easy questions, 5 medium, and 5 difficult (across 3 sections), a greater number of above-average students can come within striking distance of a perfect score if the new requirement were 10 easy, 9 medium, and 1 hard. I don't know exactly how they screwed around with the numbers, but that's what they pay their psychometricians big bucks to do. Now, reducing the difficulty of attaining elite scores, without also raising mean scores (as with the 1994 recentering), can only have had the goal of reducing a gap that exists at the level of variance, not a gap between means. This, then, cannot be a racial gap but the male-female gap, since here the difference in means is probably 0-2 IQ points, although male variance is consistently greater.

Certainly this reduces the power of the SAT to detect very brainy people -- those with an IQ of 145 or 160 or whatever big number you want -- but I can easily imagine that both ETS and elite universities such as Harvard were willing to trade off a bit of g-loadedness in order to close the male-female gap at the elite level. Harvard students wouldn't look stupider, of course: their prestige is based on their mean SAT score compared to those of others. And they probably have other ways of figuring out who is very brainy vs. fairly smart. (As an aside, this also explains why lots more high-scoring applicants will be rejected by top schools, another paradox that is easily, even if only partially, resolved by clear thinking.) Moreover, attending Harvard isn't all about having a 145 IQ -- a non-trivial number of their graduates will join professions that don't require eigth-grade algebra or sophisticated analysis (say, political office). So that, too, may lessen their concern over the SAT becoming somewhat less g-loaded.

Not so with the GRE -- those who score at the elite level here are hardcore nerds who are planning to do serious intellectual work, and elite graduate departments pay attention mostly to the applicant's intellectual promise. MIT's math department probably doesn't care that an applicant scored 650 on the Math portion but showed singular potential for leadership roles. So, I imagine something similar to the SAT make-over happened, only this time the professors and/or ETS' psychometricians discovered that it would make a joke of a test used to detect the very brainy in search of elite graduate work.

To make this concrete, let's assume that, among applicants to graduate school in the arts and sciences (i.e., future scholars, not professionals), males enjoy only a 0.1 SD advantage in mean IQ (or 1.5 IQ points), as well as a 0.05 SD advantage in their standard deviation. Then a test that is reliable up to 3 SD above the female mean will have 30% of those above this threshold being female. (For comparison with the real world, grad students at CalTech are 30% female.) Almost 10 percentage points can be gained by dumbing the test down so that it's only reliable up to about 2 SD, in which case 39% at the top will be female. Dumbing it down further so that it can only detect those 1 SD above the female mean just adds about 5 further percentage points; females will make up 45%. My guess is that they weren't foolish enough to toy around with a GRE that only tested up to an IQ of 115, but that they took a risk on some version that tested up to about an IQ of 130. Though that's just about enough to get you into MENSA, the real hullabaloo over sex disparities has raged within the halls of the uber-elite: Harvard (Larry Summers), MIT (Nancy Hopkins), Stanford (Ben Barres), and so on. At such an elite level, an applicant with an IQ of 130 would be like a 6'3 guy trying out for the NBA (whose mean height is 6'7). Although the NBA doesn't automatically weed out those 6'3 and under, surely the recruiters would protest to the manufacturers if their new-fangled measuring sticks only measured up to 6'3!

Pursuing this hunch, I picked up my Kaplan GRE self-study book and found out that they knew at least roughly what the new GRE was going to look like. Here were the proposed new question types for Verbal and Math:

Reading Comprehension (4 types)
Sentence Completion (2 types)

Word Problems (4 types)
Data Interpretation (2 types)
Quantitative Comparison (1 type, as before)

Notice the huge change in the Verbal test, which parallels the change in the SAT Verbal test: analogies are gone, and most of the test is reading comprehension. As for Math, they did keep the Quant Comps, but most of the new question types thereof sound too touchy-feely to be of good use: Word Problems include old-fashioned ones, plus "Free Response," "All That Apply," and "Conditional Table" (Kaplan admits they didn't know the exact names -- maybe the last was a contingency table type?). "Free Response" sounds like it would be more g-loaded since you can't rely on answer choices, but it definitely isn't, at least not if this type was to resemble its counterpart on the SAT. Here, you grid in your own answer, but only non-negative rational numbers can be gridded, precluding the use of any questions whose answer had a root or exponent or absolute value, whose trick hinged on the properties of positives vs negatives vs 0, whose answer was an equation or inequality, and most importantly whose point was abstract symbol manipulation (such as "solve for V in terms of p, q, and r"). Since females are better than males at calculation, and worse than males on more abstract math problems, "Free Response" is an easy way to obscure the male advantage at "thinking" math.

Not knowing much about what the other two new types of Word Problems are, I think it's still safe to say they were just as vacuous. In fact, the Data Interpretation problems were to come in 2 types: the old-fashioned one, and a new one called -- don't laugh -- "Sentence Completion"! For christ's sake, why not just turn some of the harder ones into Writing problems in disguise, where the test-taker corrects the grammar of a word problem rather than actually solve it! This psychometric flimflam is ultimately what all would-be gap-reducers must reduce themselves to, at least when the concern is the sex gap at uber-elite levels where those who matter will brook no nonsense over the basic tests being dumbed down.

[1] Since I'm pretty tired by now of writing about the "women in science" topic, for background info I'll just link to a very lengthy post of mine on point, plus Steven Pinker's debate with Elizabeth Spelke.

[2] See p.2 of the full PDF linked to in this post from the GNXP archives. It contains a graphic showing the g-loadedness of various cognitive tasks. Analogies are the most highly g-loaded verbal tasks, reading comprehension one of the least so (though still enough to validate its use on tests).

Labels: , ,