Abstract

The bacterial sequence data publicly available at the global DNA archives is a vast source of information on the evolution of bacteria and their mobile elements. However, most of it is either unassembled or inconsistently assembled and QC-ed. This makes it unsuitable for large-scale analyses, and inaccessible for most researchers to use. In 2021 Blackwell et al therefore released a uniformly assembled set of 661,405 genomes, consisting of all publicly available whole genome sequenced bacterial isolate data as of November 2018, along with various search indexes. In this study we extend that dataset up to August 2024, more than tripling the number of genomes. We also expand the scope, as we begin a global collaborative project to generate annotations for different species as desired by different research communities. In this study we describe the project as of release 2024-08, comprising 2,440,377 assemblies (including the 661k dataset). All 2.4 million have been uniformly reprocessed for quality criteria and to give taxonomic abundance estimates with respect to the GTDB phylogeny. We also provide antimicrobial resistance (AMR) gene and mutation annotation via AMRFinderPlus. Using an evolution-informed compression approach, the full set of genomes is just 130Gb in batched xz archives. We also provide multiple search indexes and a method for alignment to the full dataset. Finally, we outline plans for future annotations to be provided in further releases.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

* The actual paper has not changed. But the Abstract as shown on biorxiv was not updated at the last version, so this revision exists purely to update the abstract as shown at https://www.biorxiv.org/content/10.1101/2024.03.08.584059v2

* https://osf.io/xv7q9/

* https://allthebacteria.readthedocs.io/en/latest/overview.html#current-status

Details

Title
AllTheBacteria - all bacterial genomes assembled, available and searchable
Author
Hunt, Martin; Lima, Leandro; Anderson, Daniel; Hawkey, Jane; Shen, Wei; Lees, John; Iqbal, Zamin
University/institution
Cold Spring Harbor Laboratory Press
Section
New Results
Publication year
2024
Publication date
Nov 15, 2024
Publisher
Cold Spring Harbor Laboratory Press
ISSN
2692-8205
Source type
Working Paper
Language of publication
English
ProQuest document ID
3128428102
Copyright
© 2024. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.