Perhaps the Chinese government is not covering up the number of Covid-19 cases?

A big debate on the internet is whether China is covering up the number of cases of Covid-19 in Hubei, and more specifically Wuhan. Right now JHU says that China has 82,000 confirmed cases, as opposed to 300,000 in the USA. Both are underestimates, but there are those who believe that the Chinese death toll is not 3,000, but in the millions! I think a more sober take is that they could be underreporting by an order of magnitude. That being said, many epidemiologists believe that China’s numbers are roughly correct. And certainly, some demographic patterns to be robust and holding up (e.g., the proportion of the aged that die).

But there’s another way to estimate how many people were infected: look at the variation in the genome sequences of SARS-coV-2 itself. The genetic variation patterns in viruses that underwent massive rapid demographic expansion will be different from those that are subject to constant population size.

From what I can see Trevor Bedford and his group at UW have done the best and most thorough estimate of the number infected from the SARS-coV-2 genomes, Phylodynamic estimation of incidence and prevalence of novel coronavirus (nCoV) infections through time.

Here is a part of the abstract and methods:

Here, we use a phylodynamic approach incorporating 53 publicly available novel coronavirus (nCoV) genomes to the estimate underlying incidence and prevalence of the epidemic. This approach uses estimates of the rate of coalescence through time to infer underlying viral population size and then uses assumptions of serial interval and heterogeneity of transmission to provide estimates of incidence and prevalence. We estimate an exponential doubling time of 7.2 (95% CI 5.0-12.9) days. We arrive at a median estimate of the total cumulative number of worldwide infections as of Feb 8, 2020, of 55,800 with a 95% uncertainty interval of 17,500 to 194,400. Importantly, this approach uses genome data from local and international cases and does not rely on case reporting within China.

…. We began by running the Nextstrain nCov pipeline to align sequences and mask spurious SNPs. We took the output file masked.fasta as the starting point for this analysis. We loaded this alignment into BEAST and specified an evolutionary model to estimate:

* strict molecular clock (CTMC rate reference prior)
* exponential growth rate (Laplace prior with scale 100)
* effective population size at time of most recent sampled tip (uniform prior between 0 and 10)

We followed Andrew in using a gamma distributed HKY nucleotide substitution model. MCMC was run for 50M steps, discarding the first 10M as burnin and sampling every 30,000 steps after this to give a dataset of 1335 MCMC samples.

The file ncov.xml contains the entire BEAST model specification. To run it will require filling in sequence data; we are not allowed to reshare this data according to GISAID Terms and Conditions. The Mathematica notebook ncov-phylodynamics.nb contains code to analyze resulting BEAST output in ncov.log and plot figures.

It’s been many years since I used BEAST but it’s a complicated piece of software and has a lot of options and parameters. I’m very curious about how robust the estimate is when considering sentences such as “assume that variance of secondary cases is at most like SARS with superspreading dynamics with k=0.15.” Bedford and his colleagues know 1,000 times more about this than I do, but I am really curious about other groups looking at the data and running their models.

If all of the results are in the range of the order of magnitude of above, I think some of us really have to update our priors about how much misreporting the Chinese are engaging in…

Update: Lots of sequences here. I may try to brush up on my BEAST skills…

This is a trial run (I hope!)

Early Pleistocene enamel proteome sequences from Dmanisi resolve Stephanorhinus phylogeny:

Ancient DNA (aDNA) sequencing has enabled unprecedented reconstruction of speciation, migration, and admixture events for extinct taxa. Outside the permafrost, however, irreversible aDNA post-mortem degradation has so far limited aDNA recovery within the ~0.5 million years (Ma) time range. Tandem mass spectrometry (MS)-based collagen type I (COL1) sequencing provides direct access to older genetic information, though with limited phylogenetic use. In the absence of molecular evidence, the speciation of several Early and Middle Pleistocene extinct species remain contentious. In this study, we address the phylogenetic relationships of the Eurasian Pleistocene Rhinocerotidae using ~1.77 million years (Ma) old dental enamel proteome sequences of a Stephanorhinus specimen from the Dmanisi archaeological site in Georgia (South Caucasus). Molecular phylogenetic analyses place the Dmanisi Stephanorhinus as a sister group to the woolly (Coelodonta antiquitatis) and Merck’s rhinoceros (S. kirchbergensis) clade. We show that Coelodonta evolved from an early Stephanorhinus lineage and that this genus includes at least two distinct evolutionary lines. As such, the genus Stephanorhinus is currently paraphyletic and its systematic revision is therefore needed. We demonstrate that Early Pleistocene dental enamel proteome sequencing overcomes the limits of ancient collagen- and aDNA-based phylogenetic inference, and also provides additional information about the sex and the taxonomic assignment of the specimens analysed. Dental enamel, the hardest tissue in vertebrates, is highly abundant in the fossil record. Our findings reveal that palaeoproteomic investigation of this material can push biomolecular investigation further back into the Early Pleistocene.

Dmanisi. If that doesn’t mean something, look it up!

General laws of macroevolution from phylogenetics


I’ve been following the Evolution 2018 Meeting in Montpelier on Twitter. A lot of the stuff is interesting, though over my head. In biology, I began with a fascination with natural history. What we might term macroevolution today. I was that kid carrying out The Dinosaur Heresies when I was nine. But aside from the specific, broad patterns of diversification and extinction over geological time periods were clear. Over the years I wended my way through biochemistry, molecular evolution, and finally population genetics. My professional interests, therefore, have generally focused on patterns of variation on a microevolutionary scale. Stuff within species, not across species.

But I’m still quite interested in big picture evolutionary processes. I’m not a fan of Stephen Jay Gould, but I did read the highly repetitive and prolix The Structure of Evolutionary Theory. And, I have a decent course background in phylogenetics because of where I studied (well, at least in Bayesian and ML computational methods and the big picture theory). So I followed very closely the reports of Luke Harmon’s Presidential address at the meeting this year.

After the talk came the preprint, Macroevolutionary diversification rates show time-dependency:

For centuries, biologists have been captivated by the vast disparity in species richness between different groups of organisms. Variation in diversity is widely attributed to differences between groups in how fast they speciate or go extinct. Such macroevolutionary rates have been estimated for thousands of groups and have been correlated with an incredible variety of organismal traits. Here we analyze a large collection of phylogenetic trees and fossil time series and report a hidden generality amongst these seemingly idiosyncratic results: speciation and extinction rates follow a scaling law where both depend strongly on the age of the group in which they are measured. This time-scaling has profound implications for the interpretation of rate estimates and suggests there might be general laws governing macroevolutionary dynamics.

The primary text is pretty lucid. The major figure is at the top of the post. The authors tried to check if the pattern that they saw was a statistical artifact, and they don’t think it is. Rather, they believe that the pattern is a reflection of some genuine material or dynamic processes latent in the origin and extinction of species. They conclude that “This scenario, consistent [with] our results, would imply that the scaling of rates of sedimentation, phenotypic divergence, molecular evolution, and diversification with time all might share a common cause.” More concretely, earlier in the paper they note that the K-T boundary resulted in an extinction and speciation event (for dinosaurs and mammals respectively). But these massive catastrophic shocks don’t seem to happen regularly as much as randomly.

There’s a lot to chew on in the preprint. I can’t judge the technical details, which are in the supplements. For example, I have heard of BAMM before and know some of its general principles because of my coursework, but I’ve never done this sort of analysis extensively, and so I’ve never developed a good intuition about what passes the smell test and what doesn’t. But it strikes me that this field of phylogenetics is nevertheless very accessible to those who are non-scientists but genuinely interested in evolutionary biology. I recommend you read the preprint closely if you do aver an interest in this field.

Addendum: Note that I still think that evolution is scale independent on a deep level. Should I change my mind based on this? I don’t see why.

Carving nature at its joints more realistically

If you are working on phylogenetic questions on a coarse evolutionary scale (that is, “macroevolutionary,” though I know some evolutionary geneticists will shoot me the evil eye for using that word) generating a tree of relationships is quite informative and relatively straightforward, since it has a comprehensible mapping onto to what really occurred in nature. When your samples are different enough that the biological species concept works well and gene flow doesn’t occur between node, then a tree is a tree (one reason Y and mtDNA results are so easy to communicate to the general public in personal genomics).

Everything becomes more problematic when you are working on a finer phylogenetic scale (or in taxa where inter-species gene flow is common, as is often the case with plants). And I’m using problematic here in the way that denotes a genuine substantive analytic issue, as opposed to connoting something that one has moral or ethical objections to.

It is intuitively clear that there is often genetic population structure within species, but how to summarize and represent that variant is not a straightforward task.

In 2000 the paper Inference of Population Structure Using Multilocus Genotype Data in Genetics introduced the sort of model-based clustering most famously implemented with Structure. The paper illustrates limitations with the neighbor-joining tree methods which were in vogue at the time, and contrasts them with a method which defines a finite set of populations and assigns proportions of each putative group to various individuals.

The model-based methods were implemented in numerous packages over the 2000s, and today they’re pretty standard parts of the phylogenetic and population genetic toolkits. The reason for their popularity is obvious: they are quite often clear and unambiguous in their results. This may be one reason that they emerged to complement more visualization methods like PCA and MDS with fewer a priori assumptions.

But of course, crisp clarity is not always reality. Sometimes nature is fuzzy and messy. The model-based methods take inputs and will produce crisp results, even if those results are not biologically realistic. They can’t be utilized in a robotic manner without attention to the assumptions and limitations (see A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots).

This is why it is exciting to see a new preprint which addresses many of these issues, Inferring Continuous and Discrete Population Genetic Structure Across Space*:

A classic problem in population genetics is the characterization of discrete population structure in the presence of continuous patterns of genetic differentiation. Especially when sampling is discontinuous, the use of clustering or assignment methods may incorrectly ascribe differentiation due to continuous processes (e.g., geographic isolation by distance) to discrete processes, such as geographic, ecological, or reproductive barriers between populations. This reflects a shortcoming of current methods for inferring and visualizing population structure when applied to genetic data deriving from geographically distributed populations. Here, we present a statistical framework for the simultaneous inference of continuous and discrete patterns of population structure….

The whole preprint should be read for anyone interested in phylogenomic inference, as there is extensive discussion and attention to many problems and missteps that occur when researchers attempt to analyze variation and relationships across a species’ range. Basically, the sort of thing that might be mentioned in peer review feedback, but isn’t likely to be included in any final write-ups.

As noted in the abstract the major issue being addressed here is the problem that many clustering methods do not include within their model the reality that genetic variation within a species may be present due to continuous gene flow defined by isolation by distance dynamics. This goes back to the old “clines vs. clusters” debates. Many of the model-based methods assume pulse admixtures between population clusters which are random mating. This is not a terrible assumption when you consider perhaps what occurred in the New World when Europeans came in contact with the native populations and introduced Africans. But it is not so realistic when it comes to the North European plain, which seems to have become genetically differentiated only within the last ~5,000 years, and likely seen extensive gene flow.

The figure below shows the results from the conStruct method (left), and the more traditional fastStructure (right):

There are limitations to the spatial model they use (e.g., ring species), but that’s true of any model. The key is that it’s a good first step to account for continuous gene flow, and not shoehorning all variation into pulse admixtures.

Though in beta, the R package is already available on github (easy enough to download and install). I’ll probably have more comment when I test drive it myself….

* I am friendly with the authors of this paper, so I am also aware of their long-held concerns about the limitations and/or abuses of some phylogenetic methods. These concerns are broadly shared within the field.