This is an important preprint, Legacy Data Confounds Genomics Studies:
Recent reports have identified differences in the mutational spectra across human populations. While some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data is used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and suspicious GWAS associations. Lower-quality data from the early phases of the 1kGP thus contaminates modern studies in hidden ways, it may be time to remove or upgrade such legacy sequencing data from reference databases.
Despite the fact that there are hundreds of thousands of high-quality whole genomes sequenced today, the 1000 Genome Project looms very large as a diverse reference set. As noted in this preprint the sequencing was done a while back, so some of the data is of relatively low coverage (so the variant calls may not be confident). And, I think many of us have seen publications which have left us scratching our heads a little over the years, so this preprint really comes at the right time.