A picture is worth a thousand words, part n

Share on FacebookShare on Google+Email this to someoneTweet about this on Twitter


The caption:

The first column shows the theoretical expected PC maps for a class of models in which genetic similarity decays with geographic distance (see text for details). The second column shows PC maps for population genetic data simulated with no range expansions, but constant homogeneous migration rate, in a two-dimensional habitat. The columns marked Asia, Europe and Africa are redrawn from the originals of ref. 3 [this reference is to Cavalli-Sforza‘s The History and Geography of Human Genes]. Each map is marked by which PC it represents. The order of maps in each of the last three columns was chosen to correspond with the shapes in the first two columns.

What does this mean? The authors say it best in the abstract:

Nearly 30 years ago, Cavalli-Sforza et al. pioneered the use of principal component analysis (PCA) in population genetics and used PCA to produce maps summarizing human genetic variation across continental regions. They interpreted gradient and wave patterns in these maps as signatures of specific migration events. These interpretations have been controversial, but influential, and the use of PCA has become widespread in analysis of population genetics data. However, the behavior of PCA for genetic data showing continuous spatial variation, such as might exist within human continental groups, has been less well characterized. Here, we find that gradients and waves observed in Cavalli-Sforza et al.’s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events.

Labels:

34 Comments

  1. Tx dog.

  2. @razib : That link doesn’t work. I get “Document Not Found”.

  3. link only works in firefox. if you don’t want to use firefox, just go to  
    http://tech.groups.yahoo.com/group/gnxpforum/ 
    under files check “razib_PCA”

  4. Why are the standards of expertise, scepticism, transparency and propriety so much higher in Genetics than in “Climate Science”? It’s rather dismaying for anyone (= me) whose background is in physical science.

  5. Razib: doesn’t seem to work with Firefox on Linux, either (I got it from the journal website though). 
     
    Generally, am I correct in interpreting the paper as follows: “If you apply PC maps to ‘random local diffusion’ models, you will get strong gradients and sinusoidal patterns, even if no real large-scale migration pattern exists” ? 
     
    Why are the standards of expertise, scepticism, transparency and propriety so much higher in Genetics than in “Climate Science”?  
     
    IMHO they’re not, it’s just that there are fewer loonies to confuse the issue in the eyes of the general public.

  6. Bob Sokal wrote a paper pointing out that lots of Luca’s PCs were likely edge effects and showing that completely random data, white noise, on continent maps generated the same patterns. This was in the late 80s. 
     
    And contra the abstract of the article folks in human genetics have been using PCs since the 1960s-Luca came up with looking at one at a time, which I am not sure is an advance. 
     
    Henry

  7. C-S’s patterns were plausible because most of them mapped already-suspected migrations / diffusions. At worst he was just dressing up a theory that was already there with statistical justifications. Is there any evidence that he tweaked his statistics until he got the maps he wanted?  
     
    Ruhlen’s linguistics is also a little shaky. A test of C-S’s hypothesis would be a genetic / linguistic look at the Burushaski, the Caucasians, and the Basques, because these three groups seem to be taken as survivors of pre-IE pre-Turkish pre-Sinitic peoples, probably the people who built Stonehenge (Gimbutas’s “Old Europeans”). On the other hand, at the distance of 5000 years plus, it’s unlikely that much trace remains linguistically or genetically.

  8. Generally, am I correct in interpreting the paper as follows: “If you apply PC maps to ‘random local diffusion’ models, you will get strong gradients and sinusoidal patterns, even if no real large-scale migration pattern exists” ? 
     
    yes.  
     
    Bob Sokal wrote a paper pointing out that lots of Luca’s PCs were likely edge effects and showing that completely random data, white noise, on continent maps generated the same patterns. This was in the late 80s. 
     
    right, yes, that’s cited in the abstract. but I don’t think these patterns are “edge effects”. this paper tells you mathematically why these artefacts arise when migration patterns are constant.

  9. right, yes, that’s cited in the abstract. but I don’t think these patterns are “edge effects”. this paper tells you mathematically why these artefacts arise when migration patterns are constant. 
     
    If not, what happens when your space is a circle or a torus? When the space gets larger and larger? 
     
    Henry

  10. from the paper: 
     
    if the similarity between two populations depends only on the geographic distance between them, and PCA is applied to populations that are regularly spaced within a linear, circular or two-dimensional habitat, then the resulting covariance matrices have very particular structures (known as Toeplitz, Circulant and Block Toeplitz with Toeplitz Blocks, respectively; Supplementary Fig. 3b), with eigenvectors that are sinusoidal functions (columns of the Discrete Cosine, Discrete Fourier and Two-Dimensional Discrete Cosine Transform matrices, respectively 
     
    so the shape of the habitat (in these ideal cases) determines the class of matrix, which in turn determines the structure of the eigenvectors. if I understand correctly, this should hold true regardless of the size of the matrix.

  11. toto says: 
     
    IMHO they’re not, it’s just that there are fewer loonies to confuse the issue in the eyes of the general public. 
     
    Heh, I know what you mean. Seems that plenty in the “climate science” brigade have problems with principle components analysis and keep finding hockey sticks wherever they look. 
     
    Perhaps in another 10 years (or maybe even after solar cycle 24 really starts or if the trend of the earth’s rotation rate speeding up–however small the speedup is–continues) we will truly be able to state who the loonies are. 
     
    Personally, I prefer it warm anyway! As Leif Svalgard says, these bones are getting older every year, and cold does them no good.

  12. so the shape of the habitat (in these ideal cases) determines the class of matrix, which in turn determines the structure of the eigenvectors. if I understand correctly, this should hold true regardless of the size of the matrix. 
     
    Yes, of course, I have been thinking about the eigenvalue spectrum. I ought to wait until later in the day, when I’m awake, to comment. 
     
    HCH

  13. STOP TALKING ABOUT CLIMATE SKEPTICISM please….

  14. anyone want to explain in plain terms what exactly this means? is it saying that human genetic variation is bigger and more abrupt than Cavalli-Sforza made it out to be, or smaller and less abrupt?

  15. Burushaski, the Caucasians, and the Basques, because these three groups seem to be taken as survivors of pre-IE pre-Turkish pre-Sinitic peoples, probably the people who built Stonehenge 
     
    Don’t think so, player. Why would you expect language isolates with no particular geographic proximity to Britain be any closer to ancient British languages than “winning” language families? No doubt dozens if not hundreds or thousands of language families lived and died in prehistoric Eurasia. Basque and Caucasian are not special except by virtue of being not quite dead yet. 
     
    (Gimbutas’s “Old Europeans”) 
     
    Gimbutas’s “Old Europe” = Neolithic southeastern Europe. Not Neolithic and Bronze Age Britain.

  16. is it saying that human genetic variation is bigger and more abrupt than Cavalli-Sforza made it out to be, or smaller and less abrupt? 
     
    i think this is most easily interpreted as orthogonal; it’s talking about a totally different question. consider the specific case of a SE -> NW PCA map in europe. cavalli-sforza et al. famously interpeted this as a signal of a neolithic demic diffusion which manifested as a range expansion/wave of advance. these results imply that it might simply be continuous & constant migration, as opposed to a signature of a discrete pulse 10-5 K BP. but in both cases the shape of genetic variation might be about the same, the “abruptness” is more of an issue over time than space.

  17. i think this is most easily interpreted as orthogonal; it’s talking about a totally different question. consider the specific case of a SE -> NW PCA map in europe. cavalli-sforza et al. famously interpeted this as a signal of a neolithic demic diffusion which manifested as a range expansion/wave of advance. 
     
    If I remember correctly the second European PC is NE->SW, and an edited volume from the UK was focused on interpreting it. The editors did not appreciate that it was simply at right angles to the first one and likely meant nothing. 
     
    Henry

  18. …ancient DNA extraction techniques came at just the right time. makes PCA maps for these ends less critical.

  19. I was reporting Cavalli-Sforza and Gimbutas, buddy, and as far as I know, you’re wrong about Gimbutas.  
     
    The theory would be that before the Indo-European expansion the earlier population was from a single earlier expansion, and that they survived in refuges in the Caucasus, the Basque country, and an area in Pakistan. These languages are isolates and the argument has been made that they’re all related. C-S seems to argue that the peoples are or should be genetically related too,which does seem like a stretch but which is testable. 
     
    There was recently something up on GNXP reporting that British ancestry is primarily from the pre-Celtic, pre-Anglo-Saxon, and probably pre-Indo-European inhabitants of the British Isles (the Stonehenge people). A relationship to the Basques was suggested. 
     
    All this is speculative, but C-S’s theory is extremely ambitious and if you abhor speculation you might as well not even look at it all.

  20. There was recently something up on GNXP reporting that British ancestry is primarily from the pre-Celtic, pre-Anglo-Saxon, and probably pre-Indo-European inhabitants of the British Isles (the Stonehenge people). A relationship to the Basques was suggested. 
     
    1) yes, the probability is that it is pre-anglo-saxon (is in, most british ancestry is closer to the welsh & irish than it is to the dutch or danish, with the most affinity with the dutch & danish in east anglia). 
     
    2) there are a group of lineages (e.g., R1b?) associated with “atlantic” peoples. these sorts of data suggest a connection between iberia peoples and the british. there is also the documented association of celtiberian dialects with goidelic dialects in the british isles, as well as irish legends of emigration from the north coast of spain which attests to long term connections between the two areas along the maratime fringe. 
     
    3) since most indo-european speakers exhibit more genetic affinity with their non-indo-european neighbors than with other indo-european speakers it seems plausible to assume a prior that the spread of indo-european was not characterized, usually, by total genetic replacement.

  21. [offer you critiques in a non-asshole manner and i'll leave the comments. otherwise, bye-bye! regulars get to be jerks, newbies do not]

  22. @razib : When I tried that link, I was using Firefox 2.0.0.14 on Windows. (And it wasn’t working.)

  23. @razib : When I tried that link, I was using Firefox 2.0.0.14 on Windows. (And it wasn’t working.) 
     
    yeah, the link doesn’t work for me anymore either. don’t know. guess i can’t load them on the forum. will prolly just load them on my scienceblogs since from now on….

  24. Hi all, 
     
    It’s nice to see discussion of the paper. I mainly wanted to step in to clarify two points brought up by Henry Harpending.  
     
    HH: Bob Sokal wrote a paper pointing out that lots of Luca’s PCs were likely edge effects and showing that completely random data, white noise, on continent maps generated the same patterns.  
     
    In his 1989 paper, Sokal shows that white noise will produce patterns like those in Cavalli-Sforza et al’s maps by mimicing their original approach including, crucially, interpolating the allele frequency data before PCA is applied. If you apply PCA directly to white noise you will actually -not- see these patterns. The main focus of his critique is the effects of interpolation, and the paper is very nice for the clever empirical ways it uses to show how interpolation can drive these patterns.  
     
    Our paper explains the mathematical basis for why these potentially misleading patterns appear – with or without interpolation. The key feature of the data is the spatial correlation pattern. If the allele frequency data are spatially correlated (whether it is present in the raw data or induced, so to speak, by spatially smoothing your data), these patterns can emerge.  
     
    We address the implications of the interpolation step in CS et al’s analysis briefly in the paper: 
     
    Furthermore, because Cavalli-Sforza et al. used spatial interpolation to estimate allele frequencies, their data could satisfy this condition [that genetic similarity decays with distance] even if the condition were absent in the underlying allele frequencies (Sokal et al 1999). (Use of interpolation may partly explain the similarity between Cavalli-Sforza et al.’s PC maps and those predicted by theory, particularly in Asia where their analysis was based on fewer samples. That said, recent analyses of European data without interpolation [Bauchet et al 2007] show perpendicular gradients in PC1 and PC2.)
     
     
    HH: And contra the abstract of the article folks in human genetics have been using PCs since the 1960s-Luca came up with looking at one at a time, which I am not sure is an advance. 
     
     
    We hadn’t seen previous PCA references and had read he was the first to apply PCA to gene frequency data, so we may simply have not dug deep enough on this. However, in the History and Geography of Human Genes he describes developing the “synthetic maps” approach, and I think our statement that “he pioneered the use of PCA in population genetics” is still fair. 
     
    Thanks to HH and others for the comments.

  25. consider the specific case of a SE -> NW PCA map in europe. cavalli-sforza et al. famously interpeted this as a signal of a neolithic demic diffusion which manifested as a range expansion/wave of advance. these results imply that it might simply be continuous & constant migration, as opposed to a signature of a discrete pulse 10-5 K BP. but in both cases the shape of genetic variation might be about the same, the “abruptness” is more of an issue over time than space. 
     
    I think it’s worse than that. Apparently the same results would appear even if there is no large-scale migration pattern at all, regardless of whether it’s continuous or discrete in time. Random local diffusion, completely symmetric, and thus creating no “real” pattern, would still create a (sinusoidal) gradient on the PC maps, as long as neighbouring people have similar genomes.  
     
    Or in short: C-S’s maps, by themselves, are no evidence that people (or alleles) moved in one direction (say, SE->NW) more than the other, independently of timescales. In fact it’s no evidence that people moved at all, beyond their immediate neighbourhood. 
     
    Or am I hopelessly confused?

  26. We hadn’t seen previous PCA references and had read he was the first to apply PCA to gene frequency data, so we may simply have not dug deep enough on this. However, in the History and Geography of Human Genes he describes developing the “synthetic maps” approach, and I think our statement that “he pioneered the use of PCA in population genetics” is still fair. 
     
    Here is an old paper that uses PCs: mine because the reference is handy. There was a cottage industry of this stuff in the 1970s. Luca did come up with th idea of plotting one pc at a time as a synthetic map. 
     
    Henry 
     
    item 
    Harpending, H.C. and T. Jenkins. 1973.  
    Genetic distance among southern African  
    populations, in {em Method and Theory in Anthropological Genetics}, edited by M.  
    Crawford and P. Workman, Albuquerque: University of New Mexico Press.

  27. there is also the documented association of celtiberian dialects with goidelic dialects in the british isles, as well as irish legends of emigration from the north coast of spain which attests to long term connections between the two areas along the maratime fringe. 
     
    While the rest of your post is correct, Razib, this part is totally speculative:  
     
    1. Iberian Celtic is very poorly known and only two languages have left some (few) short texts:  
     
    1.1. Celtiberian: Q-Celtic (i.e. like Goidelic or rather closer to proto-Celtic. 
     
    1.2. Lusitanian, that may not even be Celtic after all. Lusitanian it’s P-something. But P-Celtic languages are believed to be of much later evolution, so some speculate it’s some other Indo-European, not exactly Celtic, or such an old proto-Celtic that had not yet lost its P into Q. 
     
    The issue is that linguists believe that original Indo-European P evolved into Q (written as K or C) in proto-Celtic and early Celtic but latter evolved again into P in Brithonic and Gaulish (but not in Goidelic or Celtiberian). Lusitanian would be too old to be P-Celtic, but the coincidence of Q sound between Celtiberian and Goidelic is probably just a remnant of the original Celtic pronuntiation.  
     
    2. Iberian Celts had no direct connection with the British Islands (at least no connection that makes any sense archaeologically). Celts arrived to NE Iberia long before (c. 1300 BCE) their cousins arrived to Britain and were evertually cut off from the continent in the 6th century BCE by some Iberian “reconquista”. Isolated in Central and Western Iberia they never obtained the La Tène culture nor Druidism.  
     
    Instead there’s a lot of archaeological evidence for connections between Iberia and the British Islands (and other Atlantic areas) before Indo-European conquest: Megalithism and Bell Beaker phenomenons, certainly, but also as recently as the Late Bronze Age (Atlantic Bronze). Once the Celts invaded Western Iberia c. 700 BCE, these contacts were broken, except for the occasional Phoenician sailor (tin traders).  
     
    If the Irish legends mean something, they must refer to events prior to Celtization (i.e. Indo-Europeization, as Celts were the western avant-guard of Indo-European penetration) of the island.

  28. On the issue of PC-mapping.  
     
    I find that when comparing Cavalli-Sforza’s PC1 with the much more work of Baucher et al., you get the same component (more or less). Yet when you dig the alternative/parallel Bayesian K-means clustering, you realize that the PC1 is quite the equivalent of the red Eastern Med component in the K=2 level. But this component is not equally present in all groups that show (apparent) high apportion of it at K=2 level. When you look at K=5 or K=6, you realized that the relatively high apparent PC1/red component is actually extremely low for some of the samples.  
     
    So it’s not just that PC1 is what you see (either in the map of the PC graph) but something very shallow, that only considers the main elements continent-wise, ignoring the specific main elements regionally, at population or ethnic level. The most striking case is surely the Basque sample, where the red component (equivalent to PC1) and the blue one (equivalent, together with the red, to PC2) are virtually absent. What is “major” for all Europe, happens to be “trivial” for the Basques (and pretty minor for other western groups).  
     
    So, while PCs may reflect something, that something is way too diffuse and hard to interpretate. In comparison, K-means clustering, specially when deep enough, seems to give a lot more clear results. Yes or yes?

  29. I’m still working on this. C-S’s argument coordinates four types of data: genetic distributions, language distributions, the archeological record, and (after about 500 BC) the historical record. The hypothesis of migrations rather than diffusion is not entirely dependent on genetic data; genetic data has been added to already existing data (for example, about the Greek, Germanic, Celtic, and original Indo-European migrations.  
     
    In other words, if the genetic data are consistent either with migration or with gradual diffusion, in some cases there are other ways of deciding the question. On the other hand, if it is being said that S-S’s data are purely artefacts consistent with gradual diffusion, but not with migration, then C-S’s theory is hurt. 
     
    Is it primarily a question of time-frame? Because some of the best-studied events (e.g. the Gothic movement South) the process definitely went on over the period of centuries — it was a series of events rather than a single invasion (though some events in the series might have been invasions). The Goths were expanding, and after they saturated the north-flowing river valleys they hopped into the south-flowing river valleys. In the South they came into conflict with the Huns and Alans (steppe peoples) and the Roman empire and formed a new political entity which was of mixed descent from the beginning. (At the same time, the Gothic areas on the Baltic tended to become Slavic, Baltic (Lithuanian), etc., if I’m not mistaken).  
     
    In short, there was not on big event when all the Goths moved south; it was more amoeba-like, with an exploratory arm going South and the rest following a bit at a time, with the laggards probably just absorbed by the Slavs and Lithuanians.  
     
    Another people whose history is known is the Ossetes in the Caucasus. They trace back to the Scythians and Sarmatians, the first nomads. They dominated the steppe from about Rumania to about the Caspian starting around 500 BC and were a real threat to the Persians and Greeks. After 100 AD or so the Goths and Huns appeared from the north and east. Some Scythins, now called Alans and Aas, went into Roman service and left many traces in France, England, and Italy. Others took refuge in th Caucasus. Centuries later they were still there when the Mongols showed up. Their state was crushed and many went into Mongol service, and the Catholic missionaries found a considerable group of them in China / Central Asia around 1300.  
     
    Often “migrations” consist of the violent replacement of leadership of a hybrid politico-military group, with the defeated people being absorbed into the victorious group in subordinate positions.  
     
    I’m not sure how this fits with the argument, but “migration events” and “invasions” are not all that different than diffusions, in that they take can place over centuries rather than in a short period; the distinction is that the form of political organization is changed, usually violently.

  30. I just realized that I have no idea what Luca’s maps actually portray: I dropped out of this business for a few decades and never paid much attention to this stuff. 
     
    If we have some kind of data matrix we can compute eigenvectors of covariances among genes and another set among populations. The first gives synthetic alleles, the second synthetic populations. We could then smooth and interpolate either one. Which one is Luca plotting in his book? I do know that one is called principal components, the other principal coordinates, but I don’t know which is which. 
     
    And what happens when one makes a map of the other one, the one Luca didn’t use? We used to do both because the axes are the same. 
     
    Thanks, Henry

  31. Or in short: C-S’s maps, by themselves, are no evidence that people (or alleles) moved in one direction (say, SE->NW) more than the other, independently of timescales. In fact it’s no evidence that people moved at all, beyond their immediate neighbourhood. 
     
    yes, that was my impression.

  32. John Emerson, 
    migration is 1-way; diffusion is 2-way. That’s what I would say is the difference. I’m not sure how I’d detect that from PCA.  
     
    Also, migration is faster than diffusion. That should show up. An east-west migration would make the east-west genetic difference smaller, hence decreasing the weight of that component, making the north-south gradient show up first. I bet a lot of people would give the exact opposite interpretation.

  33. It’s pretty well established that some migrations were very gradual, and also that “migration” might just mean the establishment of political control of an area by outsiders (without much population substitution).

a