Eurasia + Mozabites + Papuans

I’m in a hurry right now, and won’t be posting much this week. But, I thought I’d dump some of the ADMIXTURE runs I have. This is one with 80,000 markers, and Eurasian populations, Papuans and Mozabites. I removed the New World and Africa to constrain the variance space. This time I’ve labelled the ancestral components, but do not take them totally literally. I think in the future I might just remove the Kalash to see what happens. This is K = 7. Not too busy, but I think enough K’s to separate out the various West Eurasian groups. Additionally I’ve put the genetic distances, Fst, below, and, visualized them on an MDS. Nothing too surprising.

Fst
Northeast AsianSouth AsianEuropeanWest AsianKalashSoutheast AsianPapuan
Northeast Asian00.10.130.1420.1370.0520.225
South Asian0.100.0580.070.0660.0970.201
European0.130.05800.0560.0750.1320.236
West Asian0.1420.070.05600.0980.1390.238
Kalash0.1370.0660.0750.09800.1360.243
Southeast Asian0.0520.0970.1320.1390.13600.214
Papuan0.2250.2010.2360.2380.2430.2140

Read More

Culture differences matter (even within Islam)

I’ve been keeping track of events in the Arab world only from a distance. There’s been a lot of excitement on twitter and Facebook. Since I’m not an unalloyed enthusiast for democracy I’ve not joined in in the exultation. But I’m very concerned at what I perceive are unrealistic assumptions and false correspondences. This is a big issue because the public is very ignorant of world history and geography. For example, I was listening to a radio show where Roger Cohen was a guest. Cohen covers the Middle East, so he is familiar with many of the issues to a much greater depth than is feasible for the “Average Joe.” In response to a caller who was an ethnic Egyptian American and a Coptic Christian who was concerned about possible persecution of religious minorities Cohen pointed to Turkey, which is ruled by Islamists, and has “many” Christians. His tone was of dismissal and frustration. And that was that.

Let’s look more closely. About 5-10% of Egyptians are Christian, with most estimates being closer to 10 than 5. In contrast, the non-Muslim minority in Turkey numbers at mostfew percent, with ~1% often given as a “round number.” This low fraction of non-Muslims in modern Turkey is a product of 20th century events. First, the genocide against Armenians cleared out eastern Anatolia. Second, the population exchange between Greece and Turkey in the 1920s resulted in each nation removing most of its religious minorities. Of the religious minorities which remain in Turkey, they have been subject to sporadic attacks from radicals (often Turkish nationalists, not Islamists). But from a cultural-historical perspective one of the most revealing issues has been the long-running strangulation of the institution of the Ecumenical Patriarch of the Eastern Orthodox Church by the Turkish republic.

But that’s not the big issue. Rather, it may be that Turkey is a particularly tolerant society in the Muslim Middle East when it comes to religious freedom, and so not a good model for what might play out in Egypt (and has played out in Iraq). This matters because people regularly speak of “secular Egyptians,” “secular Turks,” “Turkish Islamists,” and “Egyptian Islamists,” as if there’s a common currency in the modifiers. That is, a secular Egyptian is equivalent to a secular Turk, and Islamists in Egypt are equivalent to Islamists in Turkey (who have been in power via democratic means for much of the past 10 years). Let’s look at the Pew Global Attitudes report, which I’ve referenced before. In particular, three questions which are clear and specific. Should adulterers be stoned? Should robbers be whipped, or their hands amputated? Should apostates from Islam be subject to the death penalty?

Read More

Who are those Houston Gujus?

The figure to the left is a three dimensional representation of principal components 1, 2, and 3, generated from a sample of Gujaratis from Houston, and Chinese from Denver. When these two populations are pooled together the Chinese form a very homogeneous cluster. They don’t vary much across the three top explanatory dimensions of genetic variance. In contrast, the Gujaratis do vary. This is not surprising. In the supplements of Reconstructing Indian population history it was notable that the Gujaratis did tend to shake out into two distinct clusters in the PCAs. This is a finding you see over and over when you manipulate the HapMap Gujarati data set. In reality, there aren’t two equivalent clusters. Rather, there’s one “tight” cluster, which I will label “Gujarati_B” from now on in my data set, and another cluster, “Gujarati_A,” which really just consists of all the individuals who are outside of Gujarati_B cluster. Even when compared to other South Asian populations these two distinct categories persist in the HapMap Gujaratis.

Zack has already identified a major difference between the two clusters: Gujarat_A has some individuals with much more “West Eurasian” ancestry. To be more formal about this in the future I simply assigned individuals in my merged data set to one of the two Gujarati clusters based on their position in the first two PCs. Yesterday night I ran ADMIXTURE K = 2 to 10, with 75,000 SNPs. I also removed the Native American groups, and added more European and East Asian samples from the HapMap. Below are some populations at K = 4:

Read More

Social Class and Smoking

The New York Times highlights the issue of hospitals opting not to hire smokers. It’s not clear how many places of employment are really banning smoking (or even how strictly such regulations will be enforced), but certainly there have at least been some high-profile cases (ie, Cleveland Clinic). One question that comes immediately to mind are — who still smokes? The NYT article includes this bit:

But the American Legacy Foundation, an antismoking nonprofit group, has warned that refusing to hire smokers who are otherwise qualified essentially punishes an addiction that is far more likely to afflict a janitor than a surgeon. (Indeed, of the first 14 applicants rejected since the policy went into effect in October at the University Medical Center in El Paso, Tex., one was applying to be a nurse and the rest for support positions.)

I had the impression that the remaining prevalence of smoking is strongly stratified by social class, geography, and education; and found this study from the CDC confirms as much:

One of the biggest predictors of smoking is education. Interestingly, the least educated (<8 years) have a lower smoking rate than average, particularly if female. This rises with more education, peaking with GED holders (42%) and falling to a low of 7.2% for graduate holders. This confirms the pattern, seen elsewhere, that credentials matter as well as years of education. ie, even among the group with 12 years of education, there is a large variance between those without a diploma (31%); those with a GED (42%), and those with a diploma only (25%). Similarly, there is a large difference between some college (23%, similar to HS grads) and getting a diploma (12%).

There is also a strong gender disparity among Asians — the smoking rate for Asian men is not much lower than the average (19%), yet among women, being Asian has a comparable effect as having a graduate degree (6.5% v. 6.4%). Asian countries also have these stark gender differences when it comes to smoking rates.

Income is another big factor — 24% smoking rate above the poverty line, 33% below. I checked this out a little more in the GSS. Here, you sometimes see the "inverse U" pattern as with education — the smoking rate for 1991 stays under 33% for the first few thousand dollars, goes up to the 40s-50s for the next few thousand, and then falls to 17% for the $75k+ crowd.

Here's political affiliation:

I’ve seen this pattern a few other places as well — Independents differ on some criteria from both Republicans and Democrats. Their lack of a coherent political ideology is indicative of other traits.

Anyway, it does seem that the class concerns of a smoking ban are somewhat warranted. This is a policy unlikely to affect the doctors, surgeons, or administrators at hospitals — while it will act as a much stronger burden on less educated support staff (who are of course facing substantially higher unemployment rates now anyway).

D.I.Y. population structure inference, part 1 of many

If you’ve been reading this weblog for a while you’ve seen many images like the one above. It comes from the 2008 paper Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. The data set is from the Human Genome Diversity Project. It consists 52 groups from around the world, curated for representativeness, but also ethnic distinctiveness. They utilized the FRAPPE program, which like STRUCTURE and ADMIXTURE estimates the ancestry of individuals (and in the aggregate populations) from a a combination of components, the number of which you specify with the parameter K. In other words, this is model based. It works out really well when you have an intuition of the model you’re looking for. Imagine African Americans, who you can presume are a two-way admixture between two distinct ancestral populations. It works less well in other cases. For example, South Asians are modeled by 23andMe as a two-way admixture between Europeans and East Asians. Why this occurs is totally comprehensible; they have three (Chinese + Japanese = one) reference populations which are very different from South Asians. So the computer, being dumb but fast, simply slaps together the best inference possible from the weird constraints placed upon it. Garbage in, garbage out.

But along with PCA these sorts of algorithms which allow one to visualize variance across hundreds of thousands of markers across hundreds of individuals are very useful (though perhaps there are mentats amongst us who have no need of such techniques). You just have to use them with caution. Information may be free, but it can be misinterpreted!

Over the years of my blogging people have regularly asked questions of the form “are East Africans more closely related to West or South Africans?” They are easily answered, I would just look in the literature. But, it did take time, and I’d have to pick the right figure, look for Fst, and so forth. But that is changing.

The Nature piece “The rise of the genome bloggers” covered the change. Since last fall BGA and Dodecad have been dumping lot bar plots and PCAs on the web. Instead of looking for a paper, I have now begun to use those sites as my resource of first choice (since they’re well indexed by Google). Now with HAP you have another source of information. It’s gotten to the point that technically capable commenters are now submitting their own results!

We’ve come a long way. Academics are not miserly with information, and some of my best friends have been the gatekeepers of the data and results. But now you can find the data on the web easily. You can reprocess the data by yourself. And, you can do the analysis yourself.

I’ve been sitting back for a while, letting Dienekes, Zack, etc., do their thing. There are so many technically fluent people out there, I’ve enjoyed just consuming the raw information yield. But that ends today. Over the past week I’ve been slapping together some R functions to make it easier for me to generate bar plots at various K’s, as well as PCA’s. My goal is this: a reader asks a question, and I quickly constrain my data set appropriately and do the analyses, take the screenshots, and upload them to the servers here, and point them to the images in the comments. The main constrain should be the computational resources (ADMIXTURE can take hours). Yes, that’s where we’re at.

Every now and then I’m going to put up a post of ADMIXTURE bar plots or MDS/PCA’s. Part of the reason is that it will be useful for my later reference. Second, I think the slide show display view is probably pretty useful to get a gestalt sense of what’s going on. That’s what we’re going for: human comprehension. Below is my first slide show, from K = 2 to K = 16. That is, the models assumed two to sixteen ancestral populations. I also excluded Sub-Saharan Africans from the data set since they’re so varied. Here are the details:

Read More

A problem of aggregation of information

In a post below I regenerated the HGDP PCA plot you’ve probably seen around, except that I added my parents (and a few HapMap populations) into the plot. The PCA below was basically a visualization of the two largest independent dimensions of genetic variance in the data set. It wasn’t to scale, as the vertical African vs. non-African dimension is somewhat greater than the horizontal west vs. east dimension in magnitude. But, I argued that the positioning of my parents was deceptive as to their heritage. In the comments John Emerson offers a hypothesis to salvage the possibility that the PCA is telling us something informative about my parents’ relationship to the Uyghurs:

Was there ever an endogenous Mughal group in South Asia? If both your parents distantly came from that group, even though assimilated to the local populations for a few generations, the Uighur connection would be unremarkable.

Combing the plot I generated with the historical information this is an eminently plausible model. But we need to consider what the PCA is showing us. The position of my parents’ is a reflection of their average genetic variance in relation to two reference points. It doesn’t tell us necessarily about the constituents of that variation. By analogy, consider that the average of 2 and 4 is 3. But the average of 1 and 5 is also 3. This is basically what’s going on with my parents’ position on the plot. They, and the Uyghurs, converged on the same position through different routes.

But let’s dig deeper into the data. First, I generated a PCA plot with just Eurasian populations. That means that the largest dimension, which separates Africans from non-Africans, no longer exists with this data set. So you have a free dimension to work with. Here’s what we get:

Read More

"Inadvertent" incest detection?

Ruchira and Randall Parker point me to a new story about routine genomics screens detecting first degree incest:

Beaudet wrote in the letter that “clinicians uncovering a likely incestuous relationship may be legally required to report it to child protection services and, potentially, law enforcement officials” since the pregnancy might have occurred “in the setting of sexual abuse.”

The letter was prompted by a Baylor laboratory’s discoveries that developmental disorders in a number of pediatric patients were caused by incestuous relations not previously disclosed to doctors.

The testing is done to find the disorder’s genetic basis, typically involving mutations, deletions or duplications. But large blocks of identical DNA are evidence the child’s parentage involved first-degree relatives.

If you’re curious if you are the product of incest, David Pike’s runs of homozygosity (ROH) detector (Mozilla only) might be useful if you have a genotype file. Sometimes I wonder if mass technology is going to come to fruition far earlier than it takes to write up editorials and publish them.

Of course even people who are not the product of first-degree incest can have very long ROH. Look at Zack (though is in part a legacy of the consanguineous marriage practices common in his parents’ community). And there will be many people who are going to get their siblings typed (as I did), so amongst the intelligent set cuckoldry, rare as it is, will be exposed.

D.I.Y. PCA

Long time readers know that I have a fixation on people not taking PCA too literally as something concrete. Tonight I finally merged the HGDP data set with some of the HapMap ones I’ve been playing with, and tacked my parents onto the sample. I took the ~50 HGDP populations, added the Tuscans, the two Kenyan groups, and the Gujaratis, and merged them. I thinned the marker set to 105,000 SNPs (I had to flip the HGDP strand too). Then I just let Eigensoft do its magic, and 2 hours on I produced my own plot. I’m still getting a hang of the labeling issues, but first let’s look at what 23andMe produces (I’m green):

Now let’s see what I outputted:

I suspect that the gap between my parents and the main South Asian cluster is just an artifact of the lack of South and East Indians in the sample. Additionally, things would look different if I removed the Africans, since the first principal component would be freed up. More on that later. All in all, still pretty awesome that circa 2011 this sort of thing is just an evening’s concentration.