Population structure, concrete and ineffable

Pritchard, Jonathan K., Matthew Stephens, and Peter Donnelly. “Inference of population structure using multilocus genotype data.” Genetics 155.2 (2000): 945-959.

Before there was Structure there was just structure. By this, I mean that population substructure has always been. The question is how we as humans shall characterize and visualize it in a manner which imparts some measure of wisdom and enlightenment. A simple fashion in which we can assess population substructure is to visualize the genetic distances across individuals or populations on a two dimensional plot. Another way which is quite popular is to represent the distance on a neighbor joining tree, as on the left. As you can see this is not always satisfying: dense trees with too many tips are often almost impossible to interpret beyond the most trivial inferences (though there is an aesthetic beauty in their feathery topology!). And where graphical representations such as neighbor-joining trees and MDS plots remove too much relevant information, cluttered FSTmatrices have the opposite problem. All the distance data is there in its glorious specific detail, but there’s very little Gestalt comprehension.

Read More

What the Harappa Ancestry Project has resolved

My friend Zack Ajmal has been running the Harappa Ancestry Project for several years now. This is a non-institutional complement to the genomic research which occurs in the academy. His motivation was in large part to fill in the gaps of population coverage within South Asia which one sees in the academic literature. Much of this is due to politics, as the government of India has traditionally been reluctant to allow sample collection (ergo, the HGDP data uses Pakistanis as their South Asian reference, while the HapMap collected DNA from Indian Americans in Houston). Of course this sort of project is not without its own blind spots. Zack must rely on public data sets to get a better picture of groups like tribal populations and Dalits, because they are so underrepresented in the Diaspora from which he draws many of the project participants.

Once Zack has the genotype one of the primary things he does is add it to his broader data set (which includes many public samples) and analyze it with the Admixture model-based clustering package. What Admixture does is take a specific number of populations (e.g. K = 12) and generate quantity assignments to individuals. So, for example individual A might be assigned 40% population 1 and 60% population 2 for K = 2. Individual B might be 45% population 1 and 55% population 2. These are not necessarily ‘real’ populations. Rather, the populations and their proportions are there to allow you to discern patterns of relationships across individuals.

Since Zack has put his results online, I thought it would be useful to review what patterns have emerged over the past two years, as his sample sizes for some regions are now moderately significant. Though he has K=16 populations, not all of them will concern us, because South Asians do not tend to exhibit many of the components. I will focus on seven: S Indian, Baloch, Caucasian, NE Euro, SE Asian, Siberian and NE Asian. These are not real populations, but the labels tell you which region these components are modal. So, for example, the “S Indian” component peaks in southern India. The “Baloch” in among the Baloch people of southeastern Iran and southwest Pakistan. The “NE Euro” among the eastern Baltic peoples. The last three are Asian components, running the latitude from south to north to center. They only concern the first population of interest, Bengalis.  I will combine these last three together as “Asian.”

Below is a table, mostly individuals from Zack’s results (though there are some aggregate results from public data sets). Comments below.

Read More

Open Thread, 8/4/2013

I thought it might be useful for new readers to understand a bit about my comments policy and how I’ve come this stance. Let me give you an example of one individual who occasionally left comments on my blog, often combative, though just on the legitimate side of the trolling boundary. One of the major tactics of argument of this individual was to impute upon me particular life experiences which he thought I must have had, and so shaped my opinions. Though I do not share much about my personal life online, I do go by my “real name,” and over 11+ years of writing on the internet one can construct a rough narrative from stray anecdotes. The key is though that this picture is rough. After one exchange where my interlocutor made an inference based on his own perception of various likelihoods about me, I tired of the one sided game (he was anonymous), and looked him upon on Facebook. I left a quick comment to that effect, and asserted now the scales were somewhat balanced. He never left a comment after that incident.

Read More

The shell game of Berkeley's holistic admissions

The title refers to the basic thrust of a piece in The New York Times, Confessions of an Application Reader. The piece ends with a paragraph like so:

Underrepresented minorities still lag behind: about 92 percent of whites and Asians at Berkeley graduate within six years, compared with 81 percent of Hispanics and 71 percent of blacks. A study of the University of California system shows that 17 percent of underrepresented minority students who express interest in the sciences graduate with a science degree within five years, compared with 31 percent of white students.

You may or may not agree with this particular type of admissions policy (I do not, because I do not care if minorities are underrepresented at universities if that underrepresentation is due to transparent academic deficiencies, which I believe to be the case). Rather, I want to focus on the term ‘underrepresented minorities’ and ascertain how underrepresented minorities truly are at Berkeley. That’s easy enough to do. About ~80% of UC Berkeley undergraduates are California residents. The Census allows us to query the racial makeup of a range of age brackets for various localities. What I did was look for the percentage of individuals between the ages of 15-19 in the 2010 Census for California, approximately the source population of students who are freshman in the 2012 class.

Read More