We report on the sequencing of 10,545 human genomes at 30-40x coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single nucleotide variants in the coding and non-coding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries in average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.
The 30x means that they’re hitting each base on an average of 30 times, so they can be very confident of their call. This matters a lot for rare variants, as might be useful when it comes to idiopathic diseases. The 10,000 number is obviously to take it a step beyond the “1,000” genomes, which went well above 1,000 genomes in any case. But the coverage means that these are very confident calls for any given individual.
A distribution of variants shows that their panel of unrelated individuals (~8,000) yields ~150,000,000 single nucleotide variants (out of a genome of 3,000,000,000 bases). You see that half of these 150 million are found at counts of one across their whole sample set. In contrast, you have ~5 million variants present at allele frequencies of about 5% or more, and a bit more than ~10 million variants at 1% or more, and ~20 million variants at 0.1% or more. Remember that the 1000 Genomes paper reported that each individual within their data set have about ~5 million variants in comparison to the human reference genome.
I reiterate these dull numbers to give people a sense of what it means to have 100,000 to 1 million marker SNP-chips in humans. It is true that without imputation these chips aren’t capturing a lot of functional variants (though they’re typical designed to target a lot of the most important disease markers in particular). But when comes to capturing the shape of genetic variation they’re a very good sampling indeed. Consider, for example, the proportion and number of voters who are part of the sample for exit polls or pre-election surveys. For standard PCA or genotypic model based clustering (e.g., ADMIXTURE/STRUCTURE) anything more than 1 million markers is pretty useful from what I’ve seen, and the 100,000 to 500,000 interval is sufficient for pretty much everything. And haplotype based methods that generally use phasing, like fineSTRUCTURE, seem to do fine in the ~250,000 marker range.