Better comprehension through visualization

Zack has started to improve on static R plots with Google powered charts. Check it out. Alas, I can’t inject script tags into the body of my posts, so that’s not feasible for me. Notice on Zack’s plot that I’m more East Asian than either of my parents. The tendency first cropped up with 23andMe’s ancestry painting, and I have seen it in my own ADMIXTURE runs, so I don’t dismiss it as V2 vs. V3 chip anymore. Though I’ve ordered an upgrade myself, so we’ll see for sure. Also, though both my parents are about the same East Asian, they exhibit a different balance of East Asian subcomponents. I’ve seen this in my own ADMIXTURE runs, and I’m going to check for more fine-grained matches with the HGDP East Asian populations soon to ascertain whether their eastern ancestral mix is different. Good times.

Sweeping through a fly's genome


Credit: Karl Magnacca

The Pith: In this post I review some findings of patterns of natural selection within the Drosophila fruit fly genome. I relate them to very similar findings, though in the opposite direction, in human genomics. Different forms of natural selection and their impact on the structure of the genome are also spotlighted on the course of the review. In particular how specific methods to detect adaptation on the genomic level may be biased by assumptions of classical evolutionary genetic models are explored. Finally, I try and place these details in the broader framework of how best to understand evolutionary process in the “big picture.”

A few days ago I titled a post “The evolution of man is no cartoon”. The reason I titled it such is that as the methods become more refined and our data sets more robust it seems that previously held models of how humans evolved, and evolution’s impact on our genomes, are being refined. Evolutionary genetics at its most elegantly spare can be reduced down to several general parameters. Drift, selection, migration, etc. Exogenous phenomena such as the flux in census size, or environmental variation, has a straightforward relationship to these parameters. But, to some extent the broadest truths are nearly trivial. Down to the brass tacks what are these general assertions telling us? We don’t know yet. We’re in a time of transitions, though not troubles.

ResearchBlogging.orgGoing back to cartoons, starting around 1970 there were a series of debates which hinged around the role of deterministic adaptive forces and random neutral ones in the domain of evolutionary process. You have probably heard terms like “adaptationist,” “ultra-Darwinian,” and “evolution by jerks” thrown around. All great fun, and certainly ripe “hooks” to draw the public in, but ultimately that phase in the scientific discourse seems to have been besides the point. A transient between the age of Theory when there was too little of the empirics, and now the age of Data, when there is too little theory. Biology is a very contingent discipline, and it may be that questions of the power of selection or the relevance of neutral forces will loom large or small dependent upon the particular tip of the tree of life to which the question is being addressed. Evolution may not be a unitary oracle, but rather a cacophony from which we have to construct a harmonious symphony for our own mental sanity. Nature is one, an the joints which we carve out of nature’s wholeness are for our own benefit.

The age of molecular evolution, ushered in by the work on allozymes in the 1960s, was just a preface to the age of genomics. If Stephen Jay Gould and Richard Dawkins were in their prime today I wonder if the complexities of the issues on hand would be too much even for their verbal fluency in terms of formulating a concise quip with which to skewer one’s intellectual antagonists. Complexity does not make fodder for honest quips and barbs. You’re just as liable to inflict a wound upon your own side through clumsiness of rhetoric in the thicket of the data, which fires in all directions.

In any case, on this weblog I may focus on human genomics, but obviously there are other organisms in the cosmos. Because of the nature of scientific funding for reasons of biomedical application humans have now come to the fore, but there is still utility in surveying the full taxonomic landscape. As it happens a paper in PLos Genetics, which I noticed last week, is a perfect complement to the recent work on human selective sweeps. Pervasive Adaptive Protein Evolution Apparent in Diversity Patterns around Amino Acid Substitutions in Drosophila simulans:

Read More

Should you go to an Ivy League School, Part II

One of the topics I’ve covered here is the all-important issue of whether your choice in College matters in terms of your future earnings. To recap: the best research in the field until a few days ago suggested that the returns to going to a more selective College were quite large; a result which was somehow interpreted by many to suggest the exact opposite claim.

The original result that kicked this off a working paper by Dale and Krueger. They realized that simply comparing students who went to top schools with students who didn’t generates an obvious source of bias: students who go to highly ranked schools tend to earn more than others, but this may be due either to the impact of the school or the personal characteristics that got them in to begin with. To correct for this, the authors compared students who got into top schools, and chose to go; with students who got into those same schools, but decided to matriculate elsewhere. This is also not a perfect comparison, but manages to correct tremendously for this form of bias.

Their results suggested that something about the school was important. In the jargon, they ran a regression of future self-reported income against the identity of all 30 schools in their sample, and found that going to one school instead of another impacted your future income. Then they looked at the particular factors which might explain that, and found that what mattered was the tuition the school charged as well as its level of admissions selectivity as reported by Barron’s; but not the average SAT of the school.

Their publication paper performed virtually the same calculations. They found that the choice of school mattered; the tuition charged mattered; and that the average SAT of the school did not. Bizarrely, they claim that the results of the test for Barron’s selectivity was now no longer important, but they did not report any estimates from that specification (I’m not quite sure how that result could have changed, since the authors did not make any sample or specification changes between the two papers). In any case, even if the Barron’s selectivity measure doesn’t matter, it was clear that something about the choice of school matters, and that tuition charged is a good proxy for figuring out what that something is. In fact, their results suggest that every extra dollar of tuition provides something like a 13-15% internal real rate of return (down from a nominal 20-30% in the working paper). As is covered elsewhere, the results for SAT were highlighted, while the results for tuition were less discussed — even by the authors of the original paper.

Dale and Krueger are back with a new paper, which looks at another age group and also gets income data from government as opposed to self-reported income. Given that the correlation between self-reported income and actual income is .90, you might expect the results to be quite similar. Certainly, this is what David Leonhardt suggests in his writeup. In fact, the results are rather different. The authors now claim that neither the average SAT of the school, the tuition it charges, nor its selectivity influence future income. I have a few quibbles with this paper:

1) Unlike prior versions of their study, in this paper Kruger and Dale don’t run a specification testing whether Colleges matter at all, as opposed to the particular variables of SAT, Selectivity, or tuition. So even if the authors are correct in suggesting that, with the availability of new data and different age groups, none of their chief College selectivity variables predict future income — we don’t know whether some other aspect of College does. It’s possible that your choice of College matters even more than before, but in a different matter — ie, tuition paid could be a worse measure today given widespread tuition inflation; the US News & World report could have changed College rankings, or so forth.

2) In looking at why their results changed for this paper, Krueger and Dale find that their effects already diminish when using the sample of Colleges used for this paper as opposed to the sample from the old paper; and diminish even more when using government income rather than self-reported income. This tells us two things. One, the schools dropped for this paper (Denison, Hamilton, Kenyon, Rice, UNC) may matter a lot for future income, or else the inclusion of two historically black Colleges might affect the results. Second, it’s puzzling to think of why the results would change dramatically depending on the source of income. We know from other studies that individuals systematically under-report income both to surveys and in official government data. It’s not clear that the government data is “better” in the sense of getting a more accurate picture. The authors also exclude income received from capital gains, which doesn’t strike me as a good exclusion. Either students who went to elite schools lie more about their income, or are better at hiding it from the government (or else receive more of it in the form of capital gains). All that we can seriously say is that the conclusion you draw depends enormously on the data source you use for income and set of Colleges.

3) The results for both tuition and selectivity still show sizable effects for the 1976 cohort. Their Table 5 breaks out the effects of tuition and College selectivity by years. While none of these regressions are statistically significant on their own, the net effect is quite large. I applied the estimates on wages to the actual median wages in each time period (interpolating when the authors did not provide actual wage statistics). I estimate that a one percent increase in 1976 tuition (perhaps $100 total over four years) results in roughly a two percent increase in overall compensation through 2007 (assuming that you work for all 24 years), or $43k in non-inflation adjusted dollars. Alternately, a category shift in the Barron’s selectivity criteria (ie, from Highly Competitive to Most Competitive) is associated with $45k more in lifetime income. The effects of both selectivity and tuition grow over time, and are at their highest for wages observed in 2003-2007 (at this point, a one percent increase in tuition paid in 1976 gets you roughly $4k more a year per year. Presumably, this will rise even more by the time this cohort retires.

While the results from any one regression may not be statistically significant, that may simply be due to their sample size. The cumulative effect appears rather large in magnitude for both of the measures that were quite important in earlier drafts of the paper. This does change substantially when looking at the 1989 cohort, and it very possible that College selectivity is less important today (or else that group has not been in the workforce long enough to measure an effect).

4) Robin Hanson has some good commentary as well, focusing on the fact that the estimates for average school SAT on female earnings is negative and statistically significant. He suggests women going to more prestigious schools marry high earners, and so feel less need to make money themselves. I’ll only note that the results do subset among full-time earners, so it’s unlikely that this result is being generated by women withdrawing from the workplace altogether. The tuition/selectivity results above apply to a pooled sample of men and women, and so may result in even higher estimates for male workers.

Anyway, go check out all the papers referenced here. My prior belief on this, created by the first two Krueger and Dale papers, is that the College you go to affects your earnings. This new paper shakes this belief somewhat, and I am now not sure either way. Unfortunately, this data isn’t released publicly, so I can’t check to see if the authors calculations hold up depending on how you cut the data. In any case, you probably shouldn’t be basing your choice in Colleges on the basis of any study, and certainly not from this blog.

Visualizing "typical" Eurasians

A few weeks ago I started looking at the 23andMe raw files of some of my friends and integrating them into HGDP and HapMap population data sets. One of the first things I did is remove the African populations from my total data. The reasons is as you can see to the left, Africans occupy the largest principal component of variation, which sets them apart from Eurasians. Without this dimension of variation the non-Africans are squeezed into one dimension, and groups like Oceanians and Amerindians show up in the strangest places. But that’s because these groups are non-African, and do not differ as much along the primary west-east axis of genetic variance which shakes out out of any such analysis. Africans aren’t the only issue though. As I’ve noted before I’ve been running ADMIXTURE, and isolated groups such as the Kalash can “monopolize” one particular color. This may be due to the Kalash being some distilled essence of an ancestral population, but I suspect that it’s more genetic drift due to isolation which has made these sorts of groups distinctive. So I removed these outliers…though do note that other “outliers” often pop out of the data to take their place quite often.

Below is a slide show with the PCAs of the 1st component of variance plotted with the 2nd, 3rd, and 4th, components. At the 5th and beyond it seems that the lower eigenvectors achieve a level of stability in magnitude. Remember that the plots are not scaled. The 1st PC is about an order of magnitude bigger than the 2nd. I’ve also attached an ADMIXTURE plot with K = 12, both for populations, and the individuals who have given me their 23andMe files. I’ve placed them upon the PCA. And yes, ID001 and ID002, are my parents.

Read More

The evolution of man is no cartoon

ResearchBlogging.orgI was semi-offline for much of last week, so I only randomly heard from someone about the “Science paper” on which Molly Przeworski is an author. Finally having a chance to read it front to back it seems rather a complement to other papers, addressed to both man and beast. The major “value add” seems to be the extra juice they squeezed out of the data because they looked at the full genomes, instead of just genotypes. As I occasionally note the chips are marvels of technology, but the markers which they are geared to detect are tuned to the polymorphisms of Europeans.

Classic Selective Sweeps Were Rare in Recent Human Evolution:

Efforts to identify the genetic basis of human adaptations from polymorphism data have sought footprints of “classic selective sweeps” (in which a beneficial mutation arises and rapidly fixes in the population). Yet it remains unknown whether this form of natural selection was common in our evolution. We examined the evidence for classic sweeps in resequencing data from 179 human genomes. As expected under a recurrent-sweep model, we found that diversity levels decrease near exons and conserved noncoding regions. In contrast to expectation, however, the trough in diversity around human-specific amino acid substitutions is no more pronounced than around synonymous substitutions. Moreover, relative to the genome background, amino acid and putative regulatory sites are not significantly enriched in alleles that are highly differentiated between populations. These findings indicate that classic sweeps were not a dominant mode of human adaptation over the past ~250,000 years.

Read More

Tea leaves and population substructure


Image credit: Wikimol

Over the past few months I’ve been encouraging people to pull down ADMIXTURE, and push the public data sets through it. Additionally, you can also convert your  23andMe raw file into pedigree format pretty easily and integrate it into the public data sets with PLINK. I’ve been following Zack’s Harappa Ancestry Project pretty closely, but I’ve been running the software myself and manipulating its parameters and seeing how things shake out. But the more and more I do it, the more I wonder if it isn’t like regression analysis, a technique which is just waiting to be leveraged by human biases. I began thinking of this more deeply after a conversation with a computational biologist who outlined the structural problems with how ad hoc the utilization of statistics is in the life sciences.

These sorts of qualms are probably why I’m posting my results more on Facebook and passing them around friends, rather than putting them out there in the public domain. It isn’t that I think the results are going to be abused. I just don’t know what they mean a lot of the time. Or, perhaps more honestly I am suspicious of my own propensity to see what I suspect. A case of my priors strongly shaping the inferences which I might generate.

So I decided to do an experiment. Below are 8 runs, displayed as bar plots. Each thin sliver represents an individual. The colors again represent putative ancestral populations of which the modern populations are combinations, generated by the parameter K (so K = 2 means two ancestral populations, each corresponding to a different color). There are two data sets which I analyzed, group A and group B. I’ve also noted the K’s for each plot. But aside from that, I’ll leave you ignorant what these populations are or how many there are. Jot down some ideas as to what you can see. How many populations? How do they relate to each other? Can you perceive any real information in the higher K’s? I’ll put the “answers” below the fold. There’s no point in me saying what I think, I already know which populations these are, so I’m tainted.

[zenphotopress album=264 sort=sort_order number=8]

Read More

The 2007 crash in genome sequencing costs

Dr. Daniel MacArthur suggests:

Now, we don’t want everyone working in genomics to start using the same blue-on-grey slide to illustrate the impending datapocalypse; so I’d encourage people to download the raw data (warning: Excel file) and make their own pretty pictures.

My straightforward attempts are below. Get the raw data and try your own. I assume that R and gnuplot could produce something prettier….

Read More

Personal genome in the public domain


Image credit: Vikrum Lexicon

Manu Sporny reflects on one week of being in the public domain in terms of personal genomics. I already pulled down his data, as has Zack. The whole post is fascinating, but this is really interesting: “I found out that it’s illegal to send any of your genetic material outside of Russia to have it analyzed.” In a related vein, seen Dr. Daniel MacArthur’s When “Cautious” Means “Useless.” I know that 23andMe is a for-profit business in it to make money for its backers, but there are certainly huge social spillover effects among my set in its bringing 500,000 to 1 million markers to the masses. It’s a clear concrete case of how innovation can result in positive gains across society. I am not a knee-jerk libertarian, but your genetic data is your genetic data. Own it, analyze it, and claim it!

Media me

I’ve been rather busy this week, so few posts. But, I did a Bloggingheads.tv with Milford Wolpoff. We talk Out of Africa, Multiregionalism, and such. Second, The New York Times profiled Secular Right, where I’m a contributor. The quotes were accurate, though I do find it amusing that the reporter refers to me as an apostate, but not John Derbyshire (who until ~5 years ago was a confessing Christian). I suspect that in this day and age the term “apostate” only has strong valence in relation to Islam. For the record, several ex-Muslims have disputed my apostasy, since I barely ever believed in the Islamic religion.