
One of the things I (and probably almost anyone) do when reading a paper on population genetics which disaggregates the sample set into discrete elements is look at the number of individuals within each group. In a genetic variation sense there need not be any deep technicalities about power analysis here (though those surely are there). If you have a sample size of ~30 Han Chinese I know enough about the variation present in Han Chinese to be less worried about this N than a sample size of ~30 Xhosa, or to make it even more explicit, ~30 Brazilians. Not only is sample size important, but so is provenance. Brazilians sampled from Rio Grande do Sul are going to be different from those sampled from Bahia. The same worry applies to Han Chinese (e.g., Guangdong vs. Hunan), but to a far lesser extent in terms of magnitude.
This came to mind when reading A Genetic Atlas of Human Admixture History, a paper by Hellenthal et al. which showcases the power of modern statistical genetic inference in outlining the dynamics of historical demography. It’s a masterful work, and I’ll try and grapple with the results in a later post, time permitting. But poring over the real paper, the supplements, I came upon this table:
What I want to emphasize here are the rather small samples sizes for the English and Germans. Since genetic distance in Northwest Europe is low the small N may not be a big deal (i.e., you can swap in French or Norwegian). And the reality is that of course there’s plenty of genotypic data on English and German individuals. But much of this is locked up in biomedical studies where the data can’t be released for more widespread usage (I assume that the PopRes data set had insufficient overlap of marker sets?). In contrast you have decent sample sizes for obscure Pakistani groups like the Kalash and Burusho. Why? Because of the Human Genome Diversity Project, spearheaded by L. L. Cavalli-Sforza, author of The History and Geography of Human Genes. The HGDP data set is an awesome resource, and because of its anthropological focus it preserves the genetic variation of specific isolated groups. The Kalash of Pakistan for example look like they’re going to be forcibly converted to Islam and genetically assimilated within the generation. Late in the last decade the HGDP was released even to the public, so “citizen scientists” can perform their own analyses. Until recently I’d say The History and Geography of Human Genes was L. L. Cavalli-Sforza’s greatest (of many!) achievements. But now I’m starting to think that the HGDP may be greater, its easy availability is so taken for granted that we don’t even think about it.

As someone who says what they think and has the bias toward speaking plainly I am aware that I am open to broadsides from the armies of obscurantism (OK, frankly, I’ve been subject to many attacks over the years despite my relative obscurity; the armies of the darkness see all deviationists). They lack shame and restraint, as is the norm among true believers. They would burn books if they could, that I know. So why do I persist? Because over the long run the arc of history runs toward truth, and if not now, then in the future. Reality is, and I want to see it, understand it, and grasp with hands and comprehend it in my bones. One day in the future I’ll be proud to tell my daughter and soon-to-be-born-son that I wasn’t craven. Today, in the present, I open my mind, and take in more data and results in genetics in the past 10 years than was published over the previous 100 years! If such behavior closes off some avenues of career advancement, so be it. My eyes are open. Man does not live by sinecure alone.

Addendum: If you are interested, The Human Genome Diversity Project: An Ethnography of Scientific Practice, was a fair-minded treatment from what I recall, but I read this book nearly 10 years ago….


Comments are closed.