Content area
Full text
Introduction
Genomic epidemiology is a field that utilizes pathogen genomes to study the spread of infectious diseases through populations1. This approach has become increasingly popular due to the decreasing cost of genomic sequencing combined with increasing computational power. During the COVID-19 pandemic, increased number of countries started generating genomic data to inform public health responses2. The Global Initiative on Sharing All Influenza Data (GISAID)3 expanded to accommodate these data and now maintains the world’s largest database of SARS-CoV-2 sequences. As of December 2023, over 16 million sequences, sampled from over 200 countries/regions, have been submitted and archived. Such a vast and diverse dataset enables researchers and public health officials to identify key mutations4,5 and track the emergence of variants of interest (VOIs) or variants of concern (VOCs). Additionally, this wealth of genomic information creates opportunities to uncover the hidden characteristics of the local-scale outbreak, such as the spatial dispersal of transmission and the demographic characteristics contributing to transmission patterns. However, effectively handling the complexity of the SARS-CoV-2 genomic dataset requires addressing key challenges, such as establishing robust sampling frameworks to draw reliable conclusions and developing efficient computational algorithms/pipelines.
In genomic epidemiology, analyzing sampling biases and develop an appropriate sampling strategy are crucial steps6. Recent studies have shown that differences in epidemiology and sampling can impact our ability to identify genomic clusters7. Sampling biases can also impact phylogeographic analyses. When investigating diffusion in discrete spaces, if a specific area is overrepresented in the dataset, it may lead to an overrepresentation of the same area at inferred internal nodes1. Similarly, when investigating diffusion in continuous space, extreme sampling bias might cause the posterior distribution to exclude the true origin location of the root8.
Viral transmission happens at different spatial scales, encompassing international pandemics, domestic dispersal, and local outbreaks such as those in jails, nursing homes, hospitals, or schools. By mapping how pathogens spread through space and time, evidence-based interventions can be better developed and applied across various scales9. The well-established software package, Bayesian Evolutionary Analysis Sampling Trees (BEAST)10, implements discrete11 and continuous12 phylogeographic models. Previous studies have used the discrete model to identify the transmission clusters...