
In the past ten years with high density SNP-chip arrays instead of just representing the relationship of populations, these plots often can now illustrate the position of an individual (the methods differ, from components analysis or coordinate analysis, to multi-dimensional scaling, but the outcomes are the same).

This is clear when you move from a manageable number of populations (e.g., Europeans), to the world. In these cases you have to color in specific regions, else you’d get lost rather quickly. I can illustrate this easy enough. I’ve a data set I’m running right now with ~3,000 individuals and 250,000 SNPs. It’s a merge of HGDP, Behar et al., HapMap, etc. I decided to use PLINK to generate an MDS plot.

Most of the text is basically illegible. This is where a centroid method would do well; in lieu of a scatter of individuals you just label a population. Or, you could do something like allow points in various colors to represent populations, but put the labels at centroids only. This still runs into the problem that populations are not equidistant, so therefore you can have crowding.
Recently to address these issues I decided to use a ‘utilization distribution’ method which I saw in one of the ‘genetic map of Europe’ papers. The logic here is simple.
1) First, take the density distribution of the points on the plot by category and ‘smooth’ them. Basically this creates a continuous distribution where there was a discontinuous ones.
2) Then demarcate the central ~90% area as the bounds of the population distribution. Color these bounding lines differently.
Below you see the results:
Obviously there are some kinks to be worked out. But you see two things. First, some groups are clearly subsets of other groups in their distribution. This is very hard to discern in the other visualization methods above. Second, these plots are taking density into account, so you aren’t distracted by outliers (which may be mislabeling by the analyst or the original collector of the samples).
My ultimate aim is to develop a script which will place the text near the suitable distribution zone, without crowding out other text. I have some ideas of how to do this “on the fly,” but it will take time to implement. Until then some of you may want to know a bit about the packages used for the above.
First, download the adehabitat package from R. Actually, you may want to download various tcl development packages first, because the former won’t install without the latter. Once you have that you need data. I assume you can generate the results from PLINK above. Once you have that you need to have three colums
1) x
2) y
3) the identification
Here’s some R that might help:
#MDSData is the data frame with MDS data
attach(MDSData)
library(adehabitat)
cexValue=0
par(mar=c(0,0,0,0))
plot(C1,C2,cex=cexValue,xlab="Coordinate 1",ylab="Coordinate 2")
# process the data, remove more than 5 individuals in group
loc=subset(MDSData,Group %in% names(which(table(Group) >= 5)))
loc$X = loc$C1
loc$Y = loc$C3
#load ids
id = factor(loc$Group)
#create first parameter, two columns
loc=subset(loc,select=c(X,Y))
vud=kernelUD(loc,id)
#90% utilization
kVert=getverticeshr(vud, 9);
#I'm removing one of the populations
kVert[21]=NULL
kVertLength=length(attr(kVert,"names"))
plot(kVert, add=TRUE, lwd=2,colpol=NA,colborder=rainbow(kVertLength) )
groups=attr(kVert,"names")
legend('topright',groups,cex=.55,lty=1,lwd=3,col=rainbow(kVertLength) )


Comments are closed.