A big debate on the internet is whether China is covering up the number of cases of Covid-19 in Hubei, and more specifically Wuhan. Right now JHU says that China has 82,000 confirmed cases, as opposed to 300,000 in the USA. Both are underestimates, but there are those who believe that the Chinese death toll is not 3,000, but in the millions! I think a more sober take is that they could be underreporting by an order of magnitude. That being said, many epidemiologists believe that China’s numbers are roughly correct. And certainly, some demographic patterns to be robust and holding up (e.g., the proportion of the aged that die).
But there’s another way to estimate how many people were infected: look at the variation in the genome sequences of SARS-coV-2 itself. The genetic variation patterns in viruses that underwent massive rapid demographic expansion will be different from those that are subject to constant population size.
From what I can see Trevor Bedford and his group at UW have done the best and most thorough estimate of the number infected from the SARS-coV-2 genomes, Phylodynamic estimation of incidence and prevalence of novel coronavirus (nCoV) infections through time.
Here is a part of the abstract and methods:
Here, we use a phylodynamic approach incorporating 53 publicly available novel coronavirus (nCoV) genomes to the estimate underlying incidence and prevalence of the epidemic. This approach uses estimates of the rate of coalescence through time to infer underlying viral population size and then uses assumptions of serial interval and heterogeneity of transmission to provide estimates of incidence and prevalence. We estimate an exponential doubling time of 7.2 (95% CI 5.0-12.9) days. We arrive at a median estimate of the total cumulative number of worldwide infections as of Feb 8, 2020, of 55,800 with a 95% uncertainty interval of 17,500 to 194,400. Importantly, this approach uses genome data from local and international cases and does not rely on case reporting within China.
…
…. We began by running the Nextstrain nCov pipeline to align sequences and mask spurious SNPs. We took the output file masked.fasta as the starting point for this analysis. We loaded this alignment into BEAST and specified an evolutionary model to estimate:
* strict molecular clock (CTMC rate reference prior)
* exponential growth rate (Laplace prior with scale 100)
* effective population size at time of most recent sampled tip (uniform prior between 0 and 10)We followed Andrew in using a gamma distributed HKY nucleotide substitution model. MCMC was run for 50M steps, discarding the first 10M as burnin and sampling every 30,000 steps after this to give a dataset of 1335 MCMC samples.
The file ncov.xml contains the entire BEAST model specification. To run it will require filling in sequence data; we are not allowed to reshare this data according to GISAID Terms and Conditions. The Mathematica notebook ncov-phylodynamics.nb contains code to analyze resulting BEAST output in ncov.log and plot figures.
It’s been many years since I used BEAST but it’s a complicated piece of software and has a lot of options and parameters. I’m very curious about how robust the estimate is when considering sentences such as “assume that variance of secondary cases is at most like SARS with superspreading dynamics with k=0.15.” Bedford and his colleagues know 1,000 times more about this than I do, but I am really curious about other groups looking at the data and running their models.
If all of the results are in the range of the order of magnitude of above, I think some of us really have to update our priors about how much misreporting the Chinese are engaging in…
Update: Lots of sequences here. I may try to brush up on my BEAST skills…