It’s been a while since I updated the South Asian Genotype Project. Well, I updated almost everyone (‘projectmembers v2’ tab is one what you want). A few people had strangely formatted text files, so I’ll go add them tomorrow. Thanks to everyone who has submitted so far!
One of the main things that I’ve been curious about is undersampled groups. I finally got an Uttar Pradesh Kayastha in the data set (well, technically my second…but the first is a friend). I also got a submission of a Bengali Brahmin with origins in the west, and another in the east (in fact, from Comilla, which is where my own family is from). And, I got the submission of another West Bengali Kayastha.
Finally, I got another Maharashtra Kayastha.
If you click the image above you see some obvious things:
Bengali Brahmins don’t seem to be geographically structured. The eastern and western individuals are near each other on the PCA. Additionally, they are very close to Uttar Pradesh Pradesh Brahmins. Not the main Bangladesh cluster.
In contrast, the West Bengal Kayastha is positioned close to the Bangladeshis, though outside of that particular cluster.
In other words, to some extent Bengal’s landscape reflects both aspects of the South Asian genetic variation: it is strongly structured by caste, and, geography also plays a role. People from western Bengal have less East Asian ancestry and more affinity with peoples to the west on the Gangetic plain. But Bengali Brahmins are genetically entirely dissimilar from other Bengalis.
The dissimilar position of Kayastha groups across South Asia is in contrast to Brahmins. Though Brahmin groups in Bengal and South India seem to have mixed with local groups (they are always somewhat shifted to the regional substrate), overall their genetic character indicates shared common ancestry. In contrast, the different Kayastha groups seem much more likely to be a case of local populations who arose to fill a particular occupational niche that emerged with polities which required a bureaucratic class.
I’ve been looking at the data from the recent Munda paper. Standard stuff, admixture, treemix, and f-statistics.The northern Munda samples were collected in Bangladesh. So I thought: I can test the hypothesis that the East Asian ancestry in Bangladesh is to a large part Santhal. After looking at it every which way, I think that in fact, the Munda may not have ever been very populous in much of northeast India. The Santhal is just not a good donor population to Bengalis, at least not when comparing mixes such as Dai + Tamil.
Additionally, the Santhal are really not that well modeled by mixing South Asians with any particular Southeast Asian group, though it works. I think that’s suggestive of the possibility that the Austro-Asiatic group which gave rise to the Munda don’t exist in their current form anywhere in Southeast Asia. Additionally, the Lao samples that are provided in the new paper I think may have Indian ancestry via admixture from Austro-Asiatic Mon or Khmer groups.
Basically, there is so much bidirectional gene flow that I think it’s really hard to get a grip on what’s going on. Additionally, the Burmese and northeast Indian populations (e.g., the Mizos) clearly have a strand of ancestry that derives from relatively recent migrants that came down from the region of eastern Tibet, and perhaps Sichuan or even further north. And this component shows up in Bengalis as well.
On top of this, there is the “Australo-Melanesian” substrate that is present all across Southeast Asia, and probably was present in modern southern China in the early Holocene, which has distant affinities with the “Ancient Ancestral South Indians” (AASI).
At this point, I keep my own counsel. But there may be an interesting story to tell related to how efficient and effective different forms of agriculture were, and how that interplayed with genes and language.
Reading Indonesia: Peoples and Histories. I selected it because unlike many books it wasn’t incredibly skewed to the early modern and postcolonial period. The author makes the interesting point that the Islamicization of western Indonesia and the rise of the great Javanese Hindu kingdom of Majapahit occurred around the same time. This, in contrast to the skein of Indic civilization which had been layered over maritime Southeast Asia for hundreds of years before the medieval period, starting around 500 AD with polities such as that of Kalingga.
As is usual in these sorts of books, it is emphasized that Indian civilization spread through cultural diffusion (in contrast to the fact that though Chinese trade was evident and present early on, the cultural impact was minimal). Any migrations are dismissed as legends, with the possible exception of a few elite religious functionaries.
I now believe this is wrong. I’ve discussed this extensively in the past, but the Singapore Genome Variation Project (SGVP) data set along with more Southeast Asians allows me to illustrate rather clearly the issues. The short of it is that it is highly likely that substantial South Asian ancestry exists within Southeast Asia, and that that ancestry is not just a function of colonial contact (e.g., as certainly occurred in Malaysia).
I got a question about endogamy and Bangladeshis on of my other weblogs, as well as their relatedness to western (e.g., Iranian) and eastern (e.g., Southeast Asian) populations. Instead of talking, what do the data say? Most of you have probably seen me write about this before, but I think it might be useful to post again for Google (or Quora, as Quora seems to like my blog posts as references).
The 1000 Genomes project collected samples a whole lot of Bangladeshis in Dhaka. The figure at the top shows that the Bangladeshis overwhelmingly form a relatively tight cluster that is strongly shifted toward East Asians. There is one exception: about five individuals, several of which were collected right after each other (their sample IDs are sequential) who show almost no East Asian shift.
In the 1000 Genomes, there is a Punjabi dataset. Here is the description:
These cell lines and DNA samples were prepared from blood samples collected in Lahore, Pakistan. The samples are from a mix of parent- adult child trios and unrelated individuals who identified themselves and their parents as Punjabi.
A few years ago I did an analysis of the population structure in the 1000 Genomes dataset. In the Chinese data, there seemed to be some curious structure (there were two clusters of South Chinese). But the biggest issues predictably were in the South Asians. To give concrete examples, there were a few Brahmins in the Telugu data. A subset of Tamils and Telugus were highly ASI shifted. The Gujurati were highly heterogeneous, and one subcluster were almost certainly Patels (the samples were collected in Houston). The ASI shifted groups were almost certainly Scheduled Castes (Dalits) because I could see that they clustered with those samples from Estonian Biocentre dataset.
There was something curious about the samples from Pakistan and Bangladesh. Aside from a small number of individuals, whose samples were collected at the same time judging by their IDs (these individuals cluster with Scheduled Castes), the Bangladeshi sample didn’t have much South Asian style structure. That is, there wasn’t a cline or lots substructure within the ethnicity.
As noted by some commenters, the Punjabi samples were very different. Like the Gujurati samples, there was a huge variance along the ANI-ASI cline. To me, this was somewhat surprising. To make the 1000 Genomes more useful I used PCA and divided both Gujuratis and Punjabis into groups based on their position on the ANI-ASI cline. So that ANI_1 is the subpopulation with the most ANI and ANI_4 the least.
Using Treemix produced some weird results. As you can see above Punjabi_ANI_1 looks like an Iranian population with gene flow from Punjabi_ANI_3. Punjabi ANI_2 looks like a North Indian population with Iranian gene flow (so it is more ASI). Punjabi_ANI_3 are less ANI shifted than Uttar Pradesh Brahmins, but more than Uttar Pradesh Kshatriya. Finally, Punjabi_ANI_4 actually is very similar to Punjabi_ANI_2, except it has gene flow from a Dalit-like population.
I don’t know what’s going on here. Is this really caste-like structure in Punjab? Or are we see lots of admixture of people who are called “Punjabi” today? For example, the gene flow edges suggest lots of mixing between quite South Asian types of groups and an Iranian sort. Perhaps this is the absorption of Pathans into South Asian groups? Could it be Muhajir people who mixed with local Punjabis and identified as such?
I was curious to see if I could find something similar in relation to the three Jatts. As you can see with Treemix, no. Jatts are just very ANI-shifted. I added Lithuanians and Georgians, and you can see that Uttar Pradesh Brahmins get gene flow from a Lithuanian shifted group, while South Indian Brahmins have a more Georgian gene flow. This is just an artifact I suspect of the fact that South Indian Brahmins have a lot of admixture from non-Brahmin South Indians, who are more Georgian than Lithuanian (Iran_N as opposed to Yamnaya).
Finally, going back to the Bengali (Bangladeshi) vs. Punjabi contrast, it is really interesting. If Punjab has such deep caste-like structures it really goes to show how within South Asia caste is a very very powerful institution, and ~1,000 years of Muslim rule and in western Punjab a majority Muslim population did not break down the institution. In contrast, in Bangladesh, there doesn’t seem to be much caste structure. I am routinely the most East Asian shifted Bengali in datasets, but my family is also from the eastern edge of eastern Bengal. Why the difference?
in The Rise of Islam and the Bengal Frontier the author posits that the Islamicization of eastern Bengal was to a great extent the function of the opening up of lands for cultivation under the supervision of Muslim elites under the rule of Afghans and later Mughals. This would explain the lack of caste structure because presumably, caste structure would be difficult to maintain in a frontier landscape, where the cultural elite does not promote or accept caste (though the elite West Asian Muslims were racially exclusive, they were also a very small minority).
In contrast, the Punjab has long been settled by Indo-Aryan peoples, and despite its long history of Islam, it was not recently a frontier society.
Anyway, that’s all I got to say for that. I’m sure readers will have more insight on this pattern than I do….
I’ve been working on the South Asian Genotype Project. Again, if you are interested: send me a 23andMe, Ancestry, or Family Tree DNA raw genotype file to contactgnxp -at- gmail.com.
In the subject please put:
“South Asian Genotype Project”
The state/province your family is from
If applicable, caste
I changed the reference populations because the earlier ones were too complicated. You can see the population averages from public data sets for some groups. The results for project members are here. I re-ran everyone who has sent data in so far. I’ll leave commentary for later.
At this point, I think the easiest way to update project members is to create a mailing list. If you are have submitted genotypes, please join:
Just a quick update. I know I haven’t been responsive, but I’ve been traveling and spending time with the family and working a lot for the past few weeks. I’m going to make some revisions to my pipeline as well. I will get back to generating results soon (as in a week or so). So please keep sending data to [email protected].