|
Sunday, November 25, 2007
Linguist: I can use R, you can't. Thus, your motives are questionable. QED.
posted by p-ter @ 11/25/2007 09:18:00 AM
Mark Liberman at Language Log (a blog which I very much enjoy, I should point out) approvingly links to Cosma Shalizi's rant against Slate for publishing a series of articles on race and IQ. His conclusion:
So to start with, you should ask yourself whether you can define and calculate the variance of a set of numbers, or the correlation between two sequenccs of numbers. If not, then read the (linked) wikipedia articles -- and spend a little time playing with the concepts in the context of an interactive program like R. Once you've paid that entry fee, read Cosma's posts. (It's more fun that you might think -- I especially recommend the discussion of the heritability of zip codes, and you could go back and read the prequel about the heritability of accent.) And then go through William Saletan's articles, and decide for yourself what they mean about the abilities and motivations of the writer and his editors.It's amazing how quickly people go from simple disagreement to armchair psychologist mode; a little perspective is in order here. Dr. Liberman assumes that Cosma concludes that heritability estimates are worthless. This is not the case. Cosma points out that estimating heritability involves making assumptions that are often incorrect, but (I feel like I've said this many times before) all models are wrong, but some are useful. And buried in his prose (which contains many important, ill-understood points about the estimation of heritability), he cites a nice paper on the heritability of IQ, which concludes for a narrow-sense heritability of ~0.34 (that is, additive genetic factors account for ~34% of the variance in IQ, see the linked post). Cosma wants to add additional parameters to this model before he makes any definitive statements, but he can't bring himself to treat IQ differently than other traits: If you put a gun to my head and asked me to guess [whether there are genetic variants that contribute to IQ], and I couldn't tell what answer you wanted to hear, I'd say that my suspicion is that there are, mostly on the strength of analogy to other areas of biology where we know much more. I would then - cautiously, because you have a gun to my head - suggest that you read, say, Dobzhansky on the distinction between "human equality" and "genetic identity", and ask why it is so important to you that IQ be heritable and unchangeable.So if he had to guess, there is probably a genetic component to IQ, environment also plays a role, and human equality is not dependent on genetic identity. Seriously, read Saletan's column--these are exactly his points! Referring back to my point about the utility of incorrect models, it's worth noting that, if you don't accept any of the heritability estimates proposed in humans, you're rejecting that any trait could be determined to have a genetic component before, oh, 2001. I don't think that's a good idea, and here's why: the heritability of type II diabetes was estimated at a "mere" 0.25 (using all those horribly flawed methods, and including, since it is a dichotomous trait, even more assumptions); now molecular studies have identified at least 9 loci involved in the disease. The heritability of Type I diabetes was estimated at about 0.88; now, there are 10 loci undoubtably associated with the disease. There are other examples, and more sure to come, but suffice it to say that heritability studies, with all their seemingly ridiculous assumptions, are not worthless. Now look to Cosma's post on g. Again, this time in the footnotes, we see something in line with Saletan's article. Referring to the observation by economist Tyler Cowen that some people he knew in a village in Mexico were smart in ways not measureable by IQ tests, he writes: Cowen points out behaviors which call for intelligence, in the ordinary meaning of the word, and that these intelligent people would score badly on IQ tests. A reasonable counter-argument would be something like: "It's true that 'intelligence', in the ordinary sense, is a very broad and imprecise concept, and it's not surprising the tests don't capture it perfectly. But the aspects of 'intelligence' they do capture are ones which are vastly more important for economic development than the ones displayed by Cowen's friends in San Agustin Oapan, however amiable or even admirable those traits might be in their own right." This would be a position about which one could have a rational argument. (Indeed, I might even agree with that statement, as far as it goes, as might A. R. Luria.)So Cosma "might" agree that intelligence, as operationally defined by psychologists, is important for economic development and differs in distribution between groups. Interesting. Cosma's posts seem to follow any discussion of IQ around in the "blogosphere". They're well-written, include legitimate discussion of many important issues in quantitative genetics and IQ testing (ok, I don't know much about IQ testing, but I'm assured this is the case by people who do), and come from an authority. But for whatever reason (I'm tempted to think that people don't actually read what he writes. I mean, it has, like, math and stuff), he's interpreted as saying that intelligence tests and the concept of heritability are entirely meaningless. That is not the case. Labels: Genetics, IQ, Statistics
Monday, November 12, 2007
In an interesting story on the relationship between teen delinquency and sex (long story short: people who concluded early sex caused delinquency unsurprisingly failed to control for genetics and were led astray) I saw this little bit:
A recent study by Scottish researchers asked whether the higher IQs seen in breast-fed children are the result of the breast milk they got or some other factor. By comparing the IQs of sibling pairs in which one was breast-fed and the other not, it found that breast milk is irrelevant to IQ and that the mother's IQ explains both the decision to breast-feed and her children's IQ.Now, this is interesting in light of the recent study claiming to find a gene-environment interaction between breast-feeding and a particular gene. The source for the claim that breast-feeding has no effect on IQ is here. I went back and looked at the recent paper's attempts at controlling for maternal IQ. Statstically, this is not a difficult thing to do-- a linear regression of child IQ on maternal IQ, breast feeding status and genotype can easily be compared with a model that includes a breast feeding staus X genotype interaction. The authors don't do this standard analysis, however--they only include a cryptic note explaining that there is no significant "interaction" between the SNP in question and maternal IQ. It's not the interaction term that's interesting, of course; it's whether the marginal effect of maternal IQ removes their already tenuous claims of an interaction between breast feeding and genotype. One gets the distinct feeling that some unfavorable results are being swept under the rug. Combine this, plus the study above, then add your prior probability that by genotyping two (2!) SNPs in the entire genome you'll find a real gene-environment interaction, and, well, it's not a stretch to say the authors haven't quite demonstrated what they think they have. Labels: Genetics, IQ, Statistics
Tuesday, March 06, 2007
A new paper in PNAS (open access) uses DNA isolated from seized ivory to investigate where elephant poaching is occurring. It's an interesting idea, but for me the idea itself takes a back seat to the clever statistical framework in which it's implemented. The analysis of DNA data is getting more and more sophisticated; this is an excellent example of that phenomenon. The paper starts with very little data-- 37 tusks from this ivory seizure along with a database of DNA samples of elephants from all over Africa. Traditional methods would treat each of the 37 tusks independently, but the authors want to consider the possibliity that all came from a single area as well as the possiblity that all are from disparate parts of the continent (the two possiblities have different implications for law enforcement). So they create a grid on all of Africa (actually, just the subset containing the elephant range) and randomly split the grid into polygons, considering each polygon as equally likely to be part of area where elephants were poached. This gives them a prior distribution for the origin of the poached elephants. They then use their data and the existing database to estimate the posterior distribution for that origin using Markov chain Monte Carlo. They provide evidence that this method works very well at distinguishing the two possiblities (a single origin of the elephants or a disparate collection), an advertisement for Bayseian methods and their ability to get as much information as possible from limited data. As can be seen in the figure, all the elephants seem to come from an area centered around Zambia. This has had some consequences: The seizure immediately followed Zambia's application to CITES for a one-off sale of their ivory stockpiles at COP12 (Conference of the Parties). That application maintained that only 135 elephants were known to have been illegally killed in Zambia during the previous 10 years, woefully shy of the 3,000-6,500 elephants we estimate to have been killed in Zambia surrounding the seizure, let alone during that entire 10-year period. Subsequent to being informed of our findings, the Zambian government replaced its director of wildlife and began imposing significantly harsher sentences for convicted ivory traffickers in its courts. Labels: Genetics, Statistics
Tuesday, February 27, 2007
One of the more heated debates in human medical genetics in the last decade or so has been centered around the Common Disease-Common Variant (CDCV) hypothesis. As the name implies, the hypothesis posits that genetic susceptibility to common diseases like hypertension and diabetes is largely due to alleles which have moderate frequency in the population. The competing hypothesis, also cleverly named, is the Common Disease-Rare Variant (CDRV) hypothesis, which suggests that multiple rare variants underlie susceptibility to such diseases. As different techniques must be used to find common versus rare alleles, this debate would seem to have major implications for the field. Indeed, the major proponents of the CDCV hypothesis were the movers and shakers beind the HapMap, a resource for the design of large-scale association studies (which are effective at finding common variants, much less so for rare variants).
However, CDCV versus CDRV is an utterly false dichotomy, as I'll explain below. This point has slipped past many of the human geneticists who actually do the work of mapping disease genes, and I feel the problem is this: essentially, geneticists are looking for a gene or the gene, so they naturally want to know whether to take an approach that will be the best for finding common variants or one for finding rare variants. However, common diseases do not follow simple Mendelian patterns-- there are multiple genes that influence these traits, and the frequencies of these alleles has a distribution. A decent null hypothesis, then, is to assume that the the frequencies of alleles underlying a complex phenotype is essentially the same as the overall distribution of allele frequencies in the population-- that is, many rare variants and some common variants. This argument would seem to favor the CDRV hypothesis. Not so. The key concept for explaining why is one borrowed from epidemiology called the population attributable risk--essentially, the number of cases in a population that can be attributed to a given risk factor. An example: imgaine smoking cigarettes gives you a 5% chance of developing lung cancer, while working in an asbestos factory gives you a 70% chance. You might argue that working in an asbestos factory is a more important risk factor than cigarette smoking, and you would be correct--on an individual level. On a population level, though, you have to take into account the fact that millions more people smoke than work in asbestos factories. If everyone stopped smoking tomorrow, the number of lung cancer cases would drop precipitously. But if all asbestos factory workers quit tomorrow, the effect on the population level of lung cancer would be minimal. So you can see where I'm going with this: common susceptibility alleles contribute disproportinately to the population attributable risk for a disease. In type II diabetes, for example, a single variant with a rather small effect but a moderate frequency accounts for 21% of all cases[cite]. So am I then arguing in favor of the CDCV hypotheis? Of course not-- rare variants, aside from being predictive for disease in some individuals, also give important insight into the biology of the disease. But it is possible right now, using genome-wide SNP arrays and databases like the HapMap, to search the entire genome for common variants that contribute to disease. This is an essential step--finding the alleles that contribute disproportionately to the population-level risk for a disease. Eventually, the cost of sequencing will drop to a point where rare variants can also be assayed on a genome-wide, high-throughput scale, but that's not the case yet. Once it is, expect the CDRV hypothesis to be trumpted as right all along. Labels: Association, Genetics, Statistics |