One K to rule them all?

Rosenberg_1048people_993markersA friend recently emailed to ask about the best way to pick a proper “K” value when inferring structure. K just being the parameter which defines how many putative ancestral populations you have in your model to explain some data on genetic variation. Obviously some value of K are more informative than others of population history.

For example, if you had 100 Swedes and 100 Yoruba Nigerians, to model the population structure you could select K = 2 or K = 50. The algorithm would produce results in the latter case, but you “know” a priori that really K = 2 is a really good model of the population history in a straightforward interpretable sense. There’s just not that much more juice to squeeze with many clustering methods out of this sort of data.

But it’s harder when you have population structure in organisms which we don’t know much about aside from the genetic data. How does one “objectively” select a K. The most common method is outlined in a 2005 paper, Detecting the number of clusters of individuals using the software structure: a simulation study:

The identification of genetically homogeneous groups of individuals is a long standing issue in population genetics. A recent Bayesian algorithm implemented in the software structure allows the identification of such groups. However, the ability of this algorithm to detect the true number of clusters (K) in a sample of individuals when patterns of dispersal among populations are not homogeneous has not been tested. The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individual-based model. We found that in most cases the estimated ‘log probability of data’ does not provide a correct estimation of the number of clusters, K. However, using an ad hoc statistic ΔK based on the rate of change in the log probability of data between successive K values, we found that structure accurately detects the uppermost hierarchical level of structure for the scenarios we tested. As might be expected, the results are sensitive to the type of genetic marker used (AFLP vs. microsatellite), the number of loci scored, the number of populations sampled, and the number of individuals typed in each sample.

There’s an old saying, “garbage in, garbage out.” The method of ΔK is useful as far as it goes, but as inputs it takes the log likelihoods from the Structure program. For Admixture you can look at cross-validation. But these statistics are subject to various assumptions and approximations (in addition, some of the priors within the clustering algorithms are gross simplifications).

This is one reason I was excited about Estimating the Number of Subpopulations (K) in Structured Populations:

A key quantity in the analysis of structured populations is the parameter K, which describes the number of subpopulations that make up the total population. Inference of K ideally proceeds via the model evidence, which is equivalent to the likelihood of the model. However, the evidence in favor of a particular value of K cannot usually be computed exactly, and instead programs such as Structure make use of heuristic estimators to approximate this quantity. We show—using simulated data sets small enough that the true evidence can be computed exactly—that these heuristics often fail to estimate the true evidence and that this can lead to incorrect conclusions about K. Our proposed solution is to use thermodynamic integration (TI) to estimate the model evidence. After outlining the TI methodology we demonstrate the effectiveness of this approach, using a range of simulated data sets. We find that TI can be used to obtain estimates of the model evidence that are more accurate and precise than those based on heuristics. Furthermore, estimates of K based on these values are found to be more reliable than those based on a suite of model comparison statistics. Finally, we test our solution in a reanalysis of a white-footed mouse data set. The TI methodology is implemented for models both with and without admixture in the software MavericK1.0.

The website for MavericK 1.0 is informative if you don’t have academic access.

Unfortunately, and probably not surprisingly, this method is not scalable to genomic data sets. E.g., they’re looking that 10, 20 or 50 loci. A “modest” human genotyping array will provide you with tens of thousands of loci (SNPs). A “standard” array will provide you with on the order of 500,000 SNPs.

But the conclusion of the paper is worth keeping in mind:

Finally, it is important to keep in mind that when thinking about population structure, we should not place too much emphasis on any single value of K. The simple models used by programs such as Structure and MavericK are highly idealized cartoons of real life, and so we cannot expect the results of model-based inference to be a perfect reflection of true population structure (see discussion in Waples and Gaggiotti 2006). Thus, while TI can help ensure that our results are statistically valid conditional on a particular evolutionary model, it can do nothing to ensure that the evolutionary model is appropriate for the data. Similarly—in spite of the results in Table 2—we do not advocate using the model evidence (estimated by TI or any other method) as a way of choosing the single “best” value of K. The chief advantage of the evidence in this context is that it can be used to obtain the complete posterior distribution of K, which is far more informative than any single point estimate. For example, by averaging over the distribution of K, weighted by the evidence, we can obtain estimates of parameters of biological interest (such as the admixture parameter a) without conditioning on a single population structure. Although one value of K may be most likely a posteriori, in general a range of values will be plausible, and we should entertain all of these possibilities when drawing conclusions.

Amen!

Inheritances in conflict?

51sdHZvYfTL._SX334_BO1,204,203,200_Evolutionary theory famously predated the emergence of genetics by decades. Initially there was some conflict between the heirs of Charles Darwin and the first geneticists in terms of their mechanistic understanding of how evolutionary process occurs. Within a few decades though genetics and evolutionary biology were synthesized so that the former came to be integral toward understanding the processes and parameters which shape the character of the latter (see The Genetical Theory of Natural Selection). E.g., imagine attempting to understand the origins and maintenance of sexual reproduction without any genetic understanding of the determination of sex and its implications for transmission.

But obviously genes are not everything when it comes to phenotypes. In particular with humans, there are complex behaviors and social interactions which seem to be persistent, and perhaps adaptive, which may not be directly contingent upon any simple genotype-phenotype map. 41YXHblIQEL This is not to say that cultural and behavioral traits have no genetic basis. To give an example, religion is a complex phenomenon which is both universal and does not seem directly encoded in one’s genes. The search for a “god gene” is futile, because religion as a phenotype is mediated by innumerable other phenotypes, which themselves have complex genetic bases.

Though culture is contingent upon genes, exhibits a character which is separable from genetic evolution. In particular, dual inheritance theory explicitly acknowledged that human cultural variation over time and space is a function of the interaction between both cultural and genetic evolution. Though there are similarities between the two, and in fact the field of cultural evolution consciously utilizes much of the same formalism as population and quantitative genetics, the modes of inheritance and nature of the origination and perpetuation of variation of the two differ a great deal.

As a rule of thumb you can posit that genetic evolution is relatively slow and torpid in relation to cultural evolution, which is protean and quicksilver. Consider that lactase persistence or high altitude adaptations are the two fastest we know for human genetics, and they occur on 1,000 year time scales. Over a 1,000 year time scale takes you from Julius Caesar to Otto the Great. It takes you from first of the Mycenaean, to Athens of Pericles.

The differences between culture and genes are important to keep in mind when one is making predictions. I’m a big fan of the Eric Kaufmann book, Shall the Religious Inherit the Earth?: Demography and Politics in the Twenty-First Century. The model outlined within the book, higher fertility for religious people, ergo, the reemergence of religion, is logically plausible. But I always must remind me people that the same concerns were prevalent in France before 1850, with the arrival of more traditional Roman Catholics into a milieu which had notably secularized and undergone early demographic transition. Why is France today not a uniformly Catholic republic? First, there is history. The migration of Muslims from North Africa. But even more important, cultural evolution, as the descendants of Spaniards, Poles, and Italians, secularized.

9780226558271There is though a difference between description, and formal modeling. The field of cultural evolution attempts to do the latter. There are several lay and specialist introductions to the field (just click some of the book links and you’ve find them all). It’s worth attempting to grapple with the domain in a more systematic way, because that’s the only way you can make predictions which make sense of the diversity we see around us.

A new preprint is an interesting addition to the literature, Gene-culture co-inheritance of a behavioral trait:

Human behavioral traits are complex phenotypes that result from both genetic and cultural transmission. But different inheritance systems need not favor the same phenotypic outcome. What happens when there are conflicting selection forces in the two domains? To address this question, we derive a Price equation that incorporates both cultural and genetic inheritance of a phenotype where the effects of genes and culture are additive. We then use this equation to investigate whether a genetically maladaptive phenotype can evolve under dual transmission. We examine the special case of altruism using an illustrative model, and show that cultural selection can overcome genetic selection when the variance in culture is sufficiently high with respect to genes. Finally, we show how our basic result can be extended to nonadditive effects models. We discuss the implications of our results for understanding the evolution of maladaptive behaviors.

The most relevant section is probably 3.2 Model 2: Cultural prisoner’s dilemma. If you don’t know what the Price Equation is, read the original paper. It will induce some clarity.

The fact that more variance in culture in relation to genes allows for selection to act more powerfully on culture, and arguably in a maladaptive manner from the gene-centric perspective, is no surprise. This preprint adds more precision and clarity. For adaptation to occur there needs to be heritable variation. One reason that cultural group selection is more plausible than genetic group selection is that genetic variation across demes is often very low. The Fst between racial groups may be 0.10 to 0.30, but it is not very common for such Fst values to be realized between two groups genuinely in competition. More often neighboring populations have much lower Fst values, though ancient DNA is suggesting that 0.05 to 0.10 values were maintained in some areas 5 to 10 thousand years ago. A simple population genetic rule of thumb is that one needs to have less than one migrant between two populations per generation for their genetic variation to increase, rather than decrease. In other words, minimal gene flow on a general scale quickly reduces between group genetic variance.

In contrast, cultural variation can be maintained because migrants can switch cultures, or, their genetic progeny can adopt the culture of one the parents in totality. In this way the later Ottoman Sultans and Umayyad rulers of Al-Andalus had been genetically transformed by generations of mixing with concubines derived from Europeans or Caucasians (i.e., those from the Caucasus), while remaining culturally very Turk and Arab respectively.

As noted in the preprint, this formal/theoretical avenue of research will allow for the development of a robust empirical research program. The data is out there.

Genetic structure matters

killerenhancedcolourschemeRecently Daniel Falush’s group came out with a preprint, A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots. If you read the science posts on this weblog (basically, if you read this weblog), and you haven’t read it, read it now.

At his weblog, Paint My Chromosomes, Falush has talked about both the production of the preprint (I had a minor stimulatory role), and the attempt to get it published somewhere. This reaction is strange to me:

We also had our first journal rejection, from eLife. It has not been my habit to live-tweet journal rejections and am not intending to start now. I am a journal editor myself and do not think the process would benefit from being turned into a public performance. I was disappointed because eLife claims to hold itself to higher standards, trying to change publication by judging papers on their true worth rather than on simple measures of impact and also because the reason given was silly:

“..but feel that the target audience is a rather specialised one.”

Of course I’m biased. But this strikes me as crazy. The third most cited paper in the history of the journal Genetics, is Jonathan Pritchard’s Inference of Population Structure Using Multilocus Genotype Data. Take a look at the list, and note the papers that it is more cited than (e.g., a Sewall Wright paper from 1931, and Tajima’s 1989 paper!).

To be sure, the number of times that a paper is cited is not a good measure of how often it is read and understood. And that’s kind of the point of Falush’s preprint, to actually give some guidance to people who use model based clustering in a turnkey fashion without any deep comprehension of its limitations and biases. The nuts & bolts of the inferences of population structure may be specialized, but analysis of structure is a routine part of many different types of papers, in particular in medical genetics where variants may have different effects in different genetic backgrounds.

The ancient of days

85251766_fea18b6004Probably the most incredible science story of the week, Eye lens radiocarbon reveals centuries of longevity in the Greenland shark (Somniosus microcephalus):

The Greenland shark (Somniosus microcephalus), an iconic species of the Arctic Seas, grows slowly and reaches >500 centimeters (cm) in total length, suggesting a life span well beyond those of other vertebrates. Radiocarbon dating of eye lens nuclei from 28 female Greenland sharks (81 to 502 cm in total length) revealed a life span of at least 272 years. Only the smallest sharks (220 cm or less) showed signs of the radiocarbon bomb pulse, a time marker of the early 1960s. The age ranges of prebomb sharks (reported as midpoint and extent of the 95.4% probability range) revealed the age at sexual maturity to be at least 156 ± 22 years, and the largest animal (502 cm) to be 392 ± 120 years old. Our results show that the Greenland shark is the longest-lived vertebrate known, and they raise concerns about species conservation.

Elisabeth Pennisi has a nice write-up, Greenland shark may live 400 years, smashing longevity record:

…Using this technique, the researchers concluded that two of their sharks—both less than 2.2 meters long—were born after the 1960s. One other small shark was born right around 1963.

The team used these well-dated sharks as starting points for a growth curve that could estimate the ages of the other sharks based on their sizes. To do this, they started with the fact that newborn Greenland sharks are 42 centimeters long. They also relied on a technique researchers have long used to calculate the ages of sediments—say in an archaeological dig—based on both their radiocarbon dates and how far below the surface they happen to be. In this case, researchers correlated radiocarbon dates with shark length to calculate the age of their sharks. The oldest was 392 plus or minus 120 years, they report today in Science. That makes Greenland sharks the longest lived vertebrates on record by a huge margin; the next oldest is the bowhead whale, at 211 years old. And given the size of most pregnant females—close to 4 meters—they are at least 150 years old before they have young, the group estimates.

How science is done

A follow up on the Ancient Archaic Admixture Into the Andamanese story, No evidence for unknown archaic ancestry in South Asia:

Genomic studies have documented a contribution of archaic Neanderthals and Denisovans to non-Africans. Recently, Mondal et al. 2016 (Nature Genetics, doi:10.1038/ng.3621) published a major dataset–the largest whole genome sequencing study of diverse South Asians to date–including 60 mainland groups and 10 indigenous Andamanese. They reported analyses claiming that nearly all South Asians harbor ancestry from an unknown archaic human population that is neither Neanderthal nor Denisovan. However, the statistics cited in support of this conclusion do not replicate in other data sets, and in fact contradict the conclusion.

Last I heard they hadn’t released the bam files. Mistakes are made, that’s how science is done, and other people help in the process of correction. But, it is starting to get worrisome to me to see papers with bioinformatic errors being published in high impact journals.

Open Thread, 8/10/2016

pydata_cover (1)Sorry about the light posting. I’ll get back into gear in a few days. Very busy professionally and personally the past week or so.

I’ve been getting into writing Python code, as opposed to reading it. It’s a different beast altogether, obviously. I’m a lot slower than I would be in Perl, but I’m getting stuff done, so that’s something. I would highly recommend Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, if you have a background in R and another scripting language.

I went to my high school reunion. It was fun and interesting. Apparently people change in a few decades…

Population genetics in the post-molecular age

51zeajUmWhL._SX316_BO1,204,203,200_An excellent open access review of population genetics history from 1966 to the present in Heredity, Population genetics from 1966 to 2016. From the abstract:

We describe the astonishing changes and progress that have occurred in the field of population genetics over the past 50 years, slightly longer than the time since the first Population Genetics Group (PGG) meeting in January 1968. We review the major questions and controversies that have preoccupied population geneticists during this time (and were often hotly debated at PGG meetings). We show how theoretical and empirical work has combined to generate a highly productive interaction involving successive developments in the ability to characterise variability at the molecular level, to apply mathematical models to the interpretation of the data and to use the results to answer biologically important questions, even in nonmodel organisms. We also describe the changes from a field that was largely dominated by UK and North American biologists to a much more international one (with the PGG meetings having made important contributions to the increased number of population geneticists in several European countries). Although we concentrate on the earlier history of the field, because developments in recent years are more familiar to most contemporary researchers, we end with a brief outline of topics in which new understanding is still actively developing.

Charlesworth & Charlesworth are giants in the field, and they’ve a lot of changes over the past few decades. If you are inclined toward a deeper exploration of population genetics with an evolutionary focus, then Elements of Evolutionary Genetics is the book for you.

A different kind of pandas

pydata_coverFor the past few days I’ve been using the Python data analysis library, or “pandas.” Most of the time I work with Perl, R, and shell scripting. But the Perl/R combination has gotten to be pretty unwieldy recently, and some of my coworkers swear by pandas. So in the interest of firm cohesion I converted to their sect. Honestly it wasn’t hard, as I’ve always enjoyed reading Python code, though I haven’t written much before.

red-panda-985643_960_720So far I’ve mostly been relying in Wes McKinney’s Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. If Python aficionados have other recommendations, I am open to them.

If I need do to text processing fast & in a pinch, I can still see falling back on Perl. And R is still a great environment to work in, I won’t be uninstalling RStudio anytime soon.

Cheaters prosper, Bayesians don’t


440px-Lance_Armstrong_2005I don’t follow cycling closely, but I once praised Lance Armstrong, who I had read about in the media, to a friend who had been a journeymen professional in the sport in the late 1990s. My friend expressed some irritation, shrugged, and told me that everyone in the sport knew that Armstrong doped. He didn’t seem to want to talk about it in any depth, as he’d left the sport anyhow, and I didn’t pursue the conversation any further. Honestly I wasn’t sure at the time whether my friend was correct, or, whether he was jealous. I assumed the former, but I didn’t totally discount the latter. How could I truly know at the time?

This was in the early 2000s. Obviously if my friend, who was very far down the rankings of competitors, knew this, many more did so. The media almost certainly suspected, but Armstrong was a great story, and most people didn’t have definitive proof. I thought of this when reading this piece in The New York Times, Clean Athletes, and Olympic Glory Lost in the Doping Era:

Babashoff arrived at the Montreal Olympics in 1976 with a chance to match the performance in 1972 of Mark Spitz, whose seven golds sealed his status as an American icon and propelled him into a career as a product pitchman. Babashoff, a teammate of Spitz’s at those Munich Olympics, swam significantly faster four years later only to settle for four Olympic silver medals and one relay gold. Her career path as a high-profile endorser and motivational speaker was blockaded by broad-backed, husky-voiced East Germans later found to have been unwitting victims of a government-sponsored doping program.

Shamed by the news media and shunned by swimming officials for pointing out her competitors’ cartoonish musculature and suggesting they were cheating, Babashoff retreated into a self-imposed, decades-long exile. She raised her son, Adam, now 30, as a single mother well out of the spotlight while working as a postal carrier in Huntington Beach, Calif.

“People knew what was going on at the time, they just didn’t know what to do about it,” Babashoff said. “It just seems so weird in this day and age that they can’t right the wrongs. It just seems like such an easy fix.”

“Well, except for their deep voices and mustaches, I think they’ll probably do fine,” she said. Her remarks were the beater that churned Cold War politics. Apologetic United States Olympic Committee officials sent the East German women flower arrangements, Babashoff wrote. In her book, Babashoff includes an open letter to Bach requesting that the female swimmers from the 1976 Olympics who finished behind the East Germans be awarded duplicate medals.

At the 1996 Olympics in Atlanta, the swimmer Allison Wagner finished second in the 400 individual medley to Michelle Smith, 26, of Ireland, whose winning time was 19.76 seconds faster than her 26th-place effort four years earlier at the Barcelona Olympics. Smith’s remarkable improvement at a relatively advanced age made her competitors suspicious.

…she had left the pool deck, panting from exhaustion, after the 400 I.M. final and had been cut off by Hungary’s Krisztina Egerszegi, the defending champion, whom she had defeated by five-tenths of a second. Wagner said: “She came right up to me and said: ‘Congratulations. You’re the true winner. I just want you to know that.’ I had never talked to her before in my life, and she said that to me.”

But when Wagner met with the news media shortly thereafter, she refrained from denigrating Smith or questioning her performance.

“I didn’t say anything because people in our swimming federation used to say to me, ‘You don’t want to be Surly Shirley, do you?’” she said, referring to Babashoff.

Depressing. But in sports where differentiation at the highest echelons can be split second, resulting in huge variation in monetary outcomes, I do wonder if there is a lot of subtle cheating which we can never even hope to detect.

The two Americas

Picture1

The employment data above are from Randall Parker (seasonally adjusted for what it’s worth), and originally the Labor Department. Randall had it as a tabular display, but I think a simple bar plot is more illustrative. The percentage of unmarried births is from the Census.

It looks like Americans with university degrees or higher are basically at full employment. Additionally, the substantial majority of Americans with university degrees or higher are in the labor force. In contrast, only a minority of Americans without high school diplomas, and only a simple majority of Americans with high school diplomas, are in the labor force.

Labor force participation is pretty straightforward. If you are looking for a job, or have a job, you are part of the labor force. Everyone else is part of the whole population (e.g., those who are homemakers, etc.).

FT_15.12.4.college.marriage2As for births to unmarried women, those with university degrees basically live in a different universe. I didn’t want to clutter the above chart anymore, so I didn’t mention divorce. But you can see from the data to the left that college educated Americans tend to have very long marriages. In contrast, when the non-college do get married, divorce is rather common.

I’m pretty bullish on America, and the world. But that’s easy for me to say, since I am the sort of person who has more work than time, and my work is very fulfilling. Also, I’m married, with beautiful healthy children. I’m a lucky person, and the world seems charmed. It’s simply not in my interest to rock the boat.

But for those for whom only desperation stretches out before them, desperate acts can seem quite rational. Those with nothing to lose have nothing to lose.