A new paper in Nature Genetics, Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery, is both interesting and important. But, as with the paper on the Andaman Islander genomes it starts out with a naive and misleading utilization of model -based clustering to frame the later results. Here’s a major offending section:
The least admixed samples were found in the NWA, AP, and PP subregions, suggesting that populations in these regions are derived from founder populations, but there was evidence of inter-regional variation in GME-specific components, suggesting the occurrence of local admixture (Fig. 1b) and potentially supporting historical events. The NWA component was found in regions from west to east across North Africa, likely representing the Berber genetic background…The AP component likely represents ancestral Arab populations and was observed in nearly all regions, possibly as a result of the Arab conquests of the seventh century coincident with the expansion of the Arabic language now spoken over much of the GME. Similarly, the Persian expansion into the TP and SD regions and parts of NEA in the fifth century was the most likely contributor of the PP signal.
Patterns of human migration and drift were recapitulated using TreeMix for GME subregions, on the basis of 1000 Genomes Project control populations…The inferred tree with no migration showed tight clusters for European and Asian populations but much greater apparent divergence among subjects from GME regions. The ordering of the GME subregional populations from the root corroborated much of the ‘out-of-Africa’ ordering of subsequent founder populations…For GME populations, distance from the root emulated the west-to-east organization of GME samples, with the PP population showing the largest inferred drift parameter, supporting a west-to-east trajectory of human migrations.
You can’t assume that a population which is near fixed for a cluster, K, is actually not admixed. If you don’t have enough variation within your data set then the ‘least admixed’ populations will come out as similar to the reference, even though they themselves are admixed.
Second, I am quite open to the idea that the Arab conquests of the 7th century were demographically significant, but these results don’t show that. The Tuscan population is not 25% Arab, due to the Arab conquests. Additionally, Arabs did not permanently alter the interior of Anatolia. Their raids went rather rather far to the west, such as the one of Amorium, but the high water mark of Arab rule in relation to the Byzantines, arguably in the decades around ~800 A.D., simply resulted in a “no man’s land” along the borders (though some Semitic peoples, some of them Arabic speaking, of Christian background did migrate into Byzantium). Similarly, the Persian-Pakistani modal cluster has nothing to do really with the Persian Empire.
This is not a big deal, but, these passages are just silly. They’re wrong on the face of it. But the “peer reviewers” that Nature Genetics assigned to this paper were probably not well versed in human historical phylogenomics. Probably they saw that the methods were sound in the broadest sense (e.g., Admixture, Treemix, PCA, etc., are all fine methods), and were unaware that the inferences made were totally wrong. Anyone who had read Lazaridis’ et al.’s The genetic structure of the world’s first farmers would see how these passages needed to be revised and changed. The clusters in admixture above are to a great extent artifacts (useful ones for GWAS, but still artifacts). The historical inferences made have little basis in reality.
Second, the genetic pattern of variation above has nothing to do with the “out-of-Africa” migration. Rather, it has to do with the fact that there is cryptic Sub-Saharan African admixture even in the “pure” samples from some regions, because Sub-Saharan admixture is rather well mixed in some groups (e.g., in Northwest Africa). The cline is less about “out-of-Africa,” and more about a cline of African ancestry. These patterns of variation have literally nothing to say about the “out-of-Africa” migration. The whole passage should have been excised.
It’s a shame that there’s all this wrong stuff in the paper. I’m a big fan of Jean-Laurent Casanova because his medical genetics is going to make a difference in lives, and, his hairdo is awesome. Andy Clark is on the paper, he’s my St. Jerome for having co-authored Principles of Population Genetics. I feel a little ridiculous making these criticisms, but I think I’m right, and it’s a shame that the authors didn’t have anyone who knew enough human population genomics to fix this portion of the paper, and it’s a shame that Nature Genetics couldn’t find peer reviewers to steer them the right direction.
Aside from the the random wrong historical inference stuff, the paper is kind of a big deal (I think Nature Genetics worthy, but I don’t know anything about this stuff in regards to publications). It confirms in the broadest outlines a lot of what we knew. The further you go from Africa the less genetically diverse populations get when it comes to looking at polymorphism diversity. Native Americans have fewer segregating polymorphisms than Eurasian populations, for example. One way to model this is as serial bottlenecks out of Africa. I think that’s too simple of a picture, as there has been a lot of gene flow and admixture over the last 10,000 years, but on the coarsest of all scales it’s not totally misleading.
But a peculiar aspect of these dynamics is that when you look at runs of homozygosity in the genome, which usually measure more recent inbreeding, the Middle East and South Asia tends to have higher lower genetic diversity. To get a sense of South Asian populations, you can read The promise of disease gene discovery in South Asia. Because of caste/jati endogamy a lot of the South Asian groups have less genetic diversity than you might expect. This has disease implications.
Middle Eastern, North African, and Pakistani populations are even more extreme. You can see it in the figure above. Across short runs of homozogosity the results converge onto what you’d expect, roughly. But Middle Eastern populations are a huge anomaly at long runs. That’s because of this:
From 20–50% of all marriages in the GME are consanguineous (as compared with <0.2% in the Americas and Western Europe)1, 2, 3, with the majority between first cousins. This roughly 100-fold higher rate of consanguinity has correlated with roughly a doubling of the rate of recessive Mendelian disease19, 20. European, African, and East Asian 1000 Genomes Project populations all had medians for the estimated inbreeding coefficient (F) of ~0.005, whereas GME F values ranged from 0.059 to 0.098, with high variance within each population (Fig. 2c). Thus, measured F values were approximately 10- to 20-fold higher in GME populations, reflecting the shared genomic blocks common to all human populations. F values were dominated by structure from the immediate family rather than historical or population-wide data trends (Supplementary Fig. 8). Examination of the larger set of 1,794 exomes that included many parent–child trios also showed an overwhelming influence of structure from the immediate family, with offspring from first-cousin marriages displaying higher F values than those from non-consanguineous marriages (Fig. 2d).
The authors masked alleles which were part of the reason that individuals were included in the data set in the first place (to prevent ascertainment bias). Rather, they were focused on genome-wide patterns of loss of function and derived alleles. Because they were looking at many low frequency variants naturally they found a lot of new variation, totally unobserved in European dominated genetic data sets. This is why bringing genomics to the world is kind of a big deal.
For me this was the most interesting, and sad, result:
Despite millennia of elevated rates of consanguinity in the GME, we detected no evidence for purging of recessive alleles. Instead, we detected large, rare homozygous blocks, distinct from the small homozygous blocks found in other populations, supporting the occurrence of recent consanguineous matings and allowing the identification of genes harboring putatively high-impact homozygous variants in healthy humans from this population. Applying the GME Variome to future sequencing projects for subjects originating from the GME could aid in the identification of causative genes with recessive variants across all classes of disease. The GME Variome is a publicly accessible resource that will facilitate a broad range of genomic studies in the GME and globally.
The theory is simple. If you have inbreeding, you bring together deleterious recessive alleles, and so they get exposed to selection. In this way you can purge the segregating genetic load. It works with plants. But humans, and complex animals in general, are not plants. More precisely the authors “compared the distributions of derived allele frequencies (DAFs) in GME and 1000 Genomes Project populations.” If the load was being purged the frequency of deleterious alleles should be lower in the inbreeding populations. It wasn’t.
Middle Easterners should stop marrying cousins to reduce the disease load. But that’s just a recommendation. Some of these nations, like Qatar, have a lot of money to throw at Mendelian diseases. Perhaps they’ll use preimplantation genetic diagnosis? I don’t know.