Early detection of emerging SARS-CoV-2 Variants

Full text

Turn on search term navigation

Introduction

Public health testing and the sequencing of SARS-CoV-2 genomes have been pivotal in advancing the development of precise vaccine therapeutics and facilitating comprehensive surveillance of circulating variants^1,2. However, clinical surveillance of COVID-19 transmission often imposes substantial demands on laboratory resources and relies on individuals actively seeking testing^3,4. These challenges underscore the need for complementary, proactive, and cost-effective methods to monitor the emergence and spread of novel SARS-CoV-2 variants.

In the United States, there has been substantial spatial and temporal variation reported for the dynamics of the COVID-19 pandemic in urban and rural counties^5,6. Studies have described disease incidence being initially high in urban locations, followed by a rapid surge in infections from rural areas^5,6. Notably, the overall disease incidence is reported to be lower in rural compared to urban regions^5,7, especially during the early days of the pandemic⁸. Given that rural communities have fewer healthcare resources and an established reluctance to seek medical care, an overarching infection prevention and control effort could benefit significantly from new interventions designed to track disease incidence.

Wastewater-based epidemiology (WBE) has emerged as a valuable alternative for tracking changes in SARS-CoV-2 viral levels and variants within a community^{9, 10, 11, 12–13}. The SARS-CoV-2 virus is a single-stranded RNA virus that can be shed in wastewater through human waste such as feces, saliva, and urine^14,15. In comparison to clinical testing, wastewater analyses provide a less-biased approach to viral surveillance, particularly in areas with limited healthcare resources and unclear testing hesitancy rates^{16, 17, 18, 19–20}. Throughout the COVID-19 pandemic, WBE served as a pivotal and cost-effective tool to monitor and characterize the emergence and spread of SARS-CoV-2 variants of concerns (VoCs), offering early detection for potential outbreaks^{9,19,21, 22, 23–24}.

The complexity of wastewater matrices presents significant challenges in obtaining high-quality nucleic acid sequences and detecting SARS-CoV-2 variants. To overcome this shortfall, targeted hybridization and amplicon-based sequencing methods have been implemented to characterize the viral composition within a sample^{25, 26, 27, 28, 29, 30, 31–32}. Complementing these sequencing approaches, bioinformatic pipelines have then been developed with unique computational considerations that support the identification and quantification of SARS-CoV-2 variants^{19,22,33, 34, 35, 36–37}. For example, the COJAC pipeline³⁴ utilizes a variant-specific mutation pattern and counts aligned read pairs to detect the presence of a VoC in wastewater samples, even with a relatively low viral load. In addition, the Vpipe³⁸, LoFreq³⁹, or iVar⁴⁰ pipeline can be used to define single nucleotide polymorphisms (SNPs) with alternative allele frequencies within the SARS-CoV-2 genome. To determine the presence and abundance of VoCs, these pooled SNP alternative allele frequencies are generally modeled as a linear combination of predefined VoCs using GISAID or UshER-curated reference barcodes^{21,22,35, 36–37}. Among various linear regression models and optimization methods to estimate the abundance of variants, a pipeline called Freyja³⁵ is frequently employed due to its simplicity in modeling and interpretation. However, despite the widespread use of these pipelines, a shared bias toward pre-defined reference barcodes exists, potentially leading to incorrect variant predictions with metadata errors. This bias can be exacerbated when either 1) the variants included in the reference barcodes do not match the current circulating VoCs in the communities, or 2) a new VoC may be circulating within the community, but has not been identified through clinical sequencing. Furthermore, since wastewater samples represent a composite of multiple clinical genomes, there may be limited statistical power to detect emerging VoCs with low abundance in a single sample.

Here, we introduce a multivariate method designed to analyze SARS-CoV-2 wastewater sequencing data and identify circulating variants. We hypothesize that our independent component analysis (ICA)-based pipeline called ICA-Var (Independent Component Analysis of Variants) can leverage multiple sequencing datasets to amplify statistical power and thereby enable early and precise detection of variants within the community. To validate our approach, we compare the results obtained using our pipeline with those generated by the state-of-the-art tool Freyja³⁵. Our approach also identifies emerging co-varying mutation patterns, which may belong to more recent VoCs or have not been reported. Collectively, our findings demonstrate the effectiveness of this proposed pipeline, even in the absence of clinical data. These results underscore the potential for ICA-Var to identify mutation patterns within the SARS-CoV-2 genome that could give rise to novel circulating variants.

Results

Large-scale genome sequencing and the development of a computational pipeline

Analyzing wastewater SARS-CoV-2 genomes presents inherent complexities resulting from various factors. For example, the use of short sequencing reads introduces challenges in accurately phasing genomes and the degradation of viral genomes in environmental samples contributes to uneven genome coverage and sequencing read depth. To address these challenges, we sequenced 3659 wastewater samples using an amplicon-targeted approach and employed stringent quality control measures. After removing duplicated samples and positive/negative controls, 2684 samples remained for quality control (QC) analysis (Supplementary Fig. 1). Using a minimum threshold of 80% genome coverage at more than 50x sequencing depth, we selected 1385 of these samples, covering 59,422 locations/mutations on the genome, for further analysis (Supplementary Figs. 1 and 2). Leveraging this extensive dataset, we developed a data-driven approach named ICA-Var. This method transforms mutation frequencies in wastewater samples into independent sources with co-varying mutation patterns and utilizes a dual-regression method to re-associate the independent sources back to the original samples (Fig. 1A and Methods). We hypothesized that the ICA sources could effectively capture the evolving SARS-CoV-2 VoCs over time, with each being characterized by distinct dominant mutation patterns or sites (Methods). To evaluate the performance of our tool, we conducted a comparative analysis of variant detection against the state-of-the-art tool known as Freyja (Fig. 1A).

[See PDF for image]

Fig. 1

ICA-Var pipeline and comparisons with Freyja.

A Proposed independent component analyses (ICA) pipeline. Two matrices are reported: SARS-CoV-2 lineages detection each week (bottom row), and potential novel mutations (top row). B Hierarchical structure of 18 variants of concerns (VoCs). Dominant mutation sites for each VoC (i.e., lineage-defining) were obtained from clinical data summarized at covspectrum.org, and the number of dominant mutations were listed in brackets. Criteria for calling a detection in the proposed pipeline were listed in shaded boxes. Abbreviations: $ρ :$ the Spearman’s Correlation coefficient; FDR: false discovery rate. C Detection of the emerging VoCs in wastewater from Southern Nevada from August 2021 to November 2023 in the proposed method (first reporting matrix in A) and the state-of-art tool Freyja. Earlier detections of the proposed method were observed for emerging variants EG. 5, HV.1, and BA.2.86 (red triangle boxes). The yellow triangle box indicates the week without wastewater sampling due to technical issues. Source data are provided as a Source Data file.

In late 2021, both Freyja and ICA-Var reliably identified B.1.617.2 (Delta) and BA.1 (Omicron) VoCs in wastewater samples (Fig. 1C, yellow in first two rows), reflecting the prevalence of both variants during this period. In 2022, ICA-Var demonstrated the ability to detect BA.2, BA.4, BA.5, BF.7, BQ.1, XBB.1, and XBB.1.5 variants one or several weeks before Freyja (Fig. 1C, green). Consistent detection of Omicron variants was obtained by Freyja and ICA-Var throughout 2022 (Fig. 1C, yellow). In 2023, both Freyja and ICA-Var successfully identified XBB.1.16 in late March (Fig. 1C, yellow), two weeks prior to the first sequenced clinical sample in Southern Nevada (Fig. 3C). For more emerging VoCs in 2023, such as EG.5, ICA-Var detected this variant in early June (first red box in Fig. 1C). In contrast, Freyja reliably identified the EG.5 signal only once the VoC became more prevalent in early July. Similarly, for HV.1, and BA.2.86 VoCs, ICA-Var detected the presence of these variants in wastewater several weeks before Freyja (Fig. 1C, red boxes).

To explore the earlier detection of the emerging VoCs EG.5, HV.1, and BA.2.86 by ICA-Var compared to Freyja, we generated a heatmap illustrating alternative allele frequencies (Fig. 2) or number of reads (Supplementary Fig. 3) at the dominant mutation sites for these variants. These heatmaps represent samples from the earliest week of detection by each method. Specifically, for EG.5, ICA-Var initially identified the variant during the week of 06/05/2023, and two wastewater samples (Urban04-060523 and Urban08-060523) in this week exhibited reliable mutation frequencies and reads at three out of eight EG.5 dominant mutation sites (Fig. 2 and Supplementary Fig. 3, top row in panel EG.5). Freyja reported abundances of 0.66% and 0.53% for EG.5 in these two samples, respectively. Furthermore, an additional wastewater sample from the same week showed reliable mutation frequencies at two EG.5 dominant mutation sites (Rural03-060723), but Freyja did not identify EG.5 in this particular sample. As a multivariate method, ICA-Var leveraged all these samples with reliable, yet relatively low prevalence of EG.5 mutation sites, thereby enhancing statistical power and consequently enabling an earlier detection of EG.5 in this particular week. Conversely, Freyja reported a 23.08% abundance of one wastewater sample (Urban04-070323) in the week of 07/03/2023, which exhibited reliable alternative allele frequencies at five of eight EG.5 dominant mutation sites (red box in panel EG.5, Fig. 2B). Both the higher number of EG.5 dominant mutation sites and their increased frequencies contributed to Freyja’s detection, with ICA-Var also detecting EG.5 during this week. Similarly, for XBB.1 (Fig. 2A, XBB.1 panel), HV.1 (Fig. 2C, HV.1 panel), and BA.2.86 (Fig. 2D, BA.2.86 panel), ICA-Var achieved earlier detection by leveraging multiple samples with reliable but relatively low prevalence of dominant mutation sites, thus enhancing statistical power. In contrast, Freyja mandated at least one individual sample to exhibit the presence of dominant mutation sites for detection (red boxes in Fig. 2).

[See PDF for image]

Fig. 2

Detection of four variants using ICA-Var and Freyja.

A Earlier detection of XBB.1, B EG.5, C HV.1, and D BA.2.86 made by the proposed method, as compared to Freyja. Each panel shows alternative allele frequencies at dominant mutation sites, with top rows showing ICA-Var’s initial detection and bottom rows showing Freyja’s initial detection. The x-axis displays determinant mutations for each variant, while the y-axis lists individual samples, with frequencies indicated by color intensity. For Freyja detection plots, sample names preceded by a plus sign indicate Freyja abundance >15%. ICA-Var also detected these variants on Freyja’s initial detection dates (as shown in Fig. 1C). Only samples with non-zero alternative allele frequencies are displayed. Source data are provided as a Source Data file.

We further evaluated the earlier detection of VoCs in 2022 for ICA-Var and Freyja (Supplementary Fig. 4). In addition to enhancing statistical power, the inclusion of deletions as additional dominant mutation sites in the proposed pipeline (indicated by orange boxes in Supplementary Fig. 4) played a significant role in the earlier detections of BA.2, BA.4, and BA.5 variants compared to Freyja. This advantage arises from the fact that no deletions were utilized in the inference process within the default settings of Freyja, a result previously discussed in another computational pipeline designed to analyze wastewater sequencing data³⁶.

Detection of VoCs in urban and rural samples

From the beginning of 2022, we sequenced and analyzed wastewater samples from rural areas in Southern Nevada. The number of urban (orange curve) and rural (yellow curve) samples for each week were plotted in Supplementary Fig. 2D. We conducted a comprehensive urban-rural epidemiological comparison in our wastewater analyses (Methods), with samples categorized as urban and rural analyzed separately for each week. We present a summary of the detection of 18 VoCs utilizing both the established Freyja pipeline (Fig. 3A) and our proposed ICA-Var pipeline. (Fig. 3B).

[See PDF for image]

Fig. 3

Variant detection in urban and rural samples.

A, B Detection of the emerging variants of concerns (VoCs) in urban and rural samples using (A) Freyja and the B ICA-Var pipeline. C Earliest date of variant detection in clinical cases from Southern Nevada; initial detection date with all wastewater samples (lightest gray), urban wastewater samples (lighter gray) and rural wastewater samples (darker gray) using Freyja and the proposed pipeline. A red asterisk (*) indicates an earlier detection date between Freyja and the proposed method. If the variant was captured by the Freyja and ICA-Var on the same earliest date, no * would be indicated. Earlier detection dates in rural samples than urban samples are highlighted in bold. A † indicates an earlier clinical report than wastewater detection. Source data are provided as a Source Data file.

ICA-Var and Freyja both identified 16 out of the 18 VoCs in urban wastewater samples prior to detecting these VoCs in wastewater samples from rural locations (Fig. 3A, B). These results suggest that SARS-CoV-2 variants typically emerge in urban areas before spreading to rural regions. However, XBB.1 and FL.1.5.1 showed an unusual pattern, with Freyja initially detecting them in rural wastewater samples (black boxes in Fig. 3A, B). Notably, while Freyja initially identified XBB.1 in a rural sample during the week of 11/07/2022 (red box in Fig. 2, panel XBB.1), ICA-Var detected this variant one week earlier in urban samples (Fig. 2, panel XBB.1 and Fig. 3C). Both ICA-Var and Freyja initially detected FL.1.5.1 in rural samples on 07/10/2023 (Fig. 3C). Detailed inspections showed that one rural sample on 07/12/2023 showed an overwhelming presence of FL.1.5.1 dominant mutations (dashed red box in Supplementary Fig. 4, panel FL.1.5.1), which contributed to this earlier detection in rural areas. In contrast, urban samples demonstrated much lower alternative allele frequencies and prevalence at FL.1.5.1 dominant mutations.

Identification of mutation sites with significant time-evolving contributions

Out of 59,422 mutation sites included, and following the analyses pipeline in Fig. 1A, a total of 730 mutation sites demonstrated significant contributions during the multivariate group ICA (Supplementary Fig. 2B). Among them, a subset of 177 mutations showed a significant time-evolving contribution from August 2021 to November 2023, and were defined as significant contributing mutation sites (Methods). As a proof of concept, we cross-referenced these 177 mutations with dominant mutation sites in B.1.617.2, BA. 1 and XBB.1 variants, and plotted their weekly contributions in Fig. 4.

[See PDF for image]

Fig. 4

Mutations with significant time-evolving contributions in the proposed method.

A Following the proposed method, 16 out of 25 determinant mutations in B.1.617.2 were identified to maintain a significant time-evolving contributions to the group, with major contributions in 2021 and early 2022. B The 25 dominant mutations in BA.1 were identified to have significant time-evolving contributions to the group, with major contributions in early 2022. C We identified 22 out of 25 determinant mutations in XBB.1 to have significant time-evolving contributions to the group, with major contributions after 2022/09. Similar contributing patterns were observed for several determinant mutations (orange boxes). Source data are provided as a Source Data file.

Significant fluctuating contributions were observed in late 2021 for 16 out of 25 dominant mutation sites in B.1.617.2 (Fig. 4, panel B.1.617.2). These contributions gradually declined through 2022 and diminished further in 2023. For the BA.1 variant, there was a noticeable increase in contributions related to the associated mutations in late 2021, peaking in early 2022 (Fig. 4, panel BA.1, orange box). For several BA.1 mutation sites, time-evolving contributions continued to fluctuate in 2023, and their involvement in other Omicron sub-lineages (e.g., XBB.1) was reported at nextstrain.org. In addition, 22 out of 25 dominant mutations in XBB.1 displayed significant time-evolving contributions, with a substantial impact after September 2022. Similar fluctuation patterns were observed for several mutation sites (Fig. 4, panel XBB.1, orange box), indicating that these mutation sites co-vary together and demonstrate a recombinant nature for XBB.1. Collectively, our data (Fig. 4) demonstrate that time-evolving contributions for mutation sites identified by ICA-Var were consistent with the clinical emergence of Delta, Omicron, and XBB.1 variants. A sensitivity analysis confirms that these observations remain robust across different statistical thresholds (mean±1 SD to mean±3 SD) for determining significant contributing mutation sites (Methods), further validating the reliability of ICA-Var’s results. These results further solidify the foundation for the proposed pipeline, indicating its potential to identify novel mutation patterns that may lead to the emergence of new variants.

Discovery of potential novel variants

Upon cross-referencing with dominant mutation sites in 15 VoCs (18 VoCs in Fig. 1B, excluding emerging variants EG.5, HV.1, and BA.2.86), a set of 113 contributing mutations sites emerged as potential novel mutations. Using a hierarchical clustering algorithm with ward distance, six clusters were obtained at a cut-off ward distance of 18 (Fig. 5A, Methods). Among these clusters, cluster 2, 3, 4, and 5 showed overlapping mutation sites with emerging variants in late 2023 (bottom table in Fig. 5A). Using cluster 3 as an example, we observed two sets of co-varying patterns after 06/2023 (dashed orange boxes in Fig. 5B), both were overlapping with dominant mutations of EG.5 and HV.1. Furthermore, there were no overlapping mutations between cluster 1 or 6 with known mutation sites in emerging variants in late 2023 (bottom table in Fig. 5A). Co-varying patterns after 2023/08 were evident for mutation sites in cluster 1 (Fig. 5C). For these eight mutations, we verified the presence of these sites in clinical sequencing data from GISAID. Our analysis revealed that these mutations had been infrequently reported in any clinical samples (Supplementary Fig. 5). Hence, these mutations could potentially lead to the emergence of novel SARS-CoV-2 variants and warrant close monitoring, pending clinical testing.

[See PDF for image]

Fig. 5

Potential emerging future mutation patterns.

A Hierarchical clustering leads to six clusters at 113 potential future mutation sites. Clusters 2, 3, 4, and 5 have overlapping mutation sites with emerging variants EG.5, HV.1, and BA.2.86. Cluster 1 and 6 show no overlapping mutation sites with known variants, and therefore, are more likely to give rise to novel lineages. B Co-varying patterns of mutation sites in cluster 3 show major fluctuating contributions after June 2023, consistent with EG.5 and HV.1 variants. C Co-varying patterns of mutation sites in cluster 1, with major fluctuating contributions after August 2023. Source data are provided as a Source Data file.

Minimum number of samples and QC metrics required to run ICA-Var

Through simulation, we determined the minimum requirements for running the ICA-Var pipeline. As a multivariate data-driven approach, ICA-Var transforms alternative allele frequencies from multiple sequencing datasets into independent sources with co-varying mutation patterns, requiring multiple sequencing datasets for analysis. These identified patterns serve as both data-driven reference barcodes and candidates for novel mutations. Our simulations (detailed in the Methods section) show that ICA-Var needs approximately eight “real” wastewater samples per variant (those with Freyja-calling abundance >95% for a specific variant) to generate reliable data-driven reference barcodes (Supplementary Fig. 6, part I). Importantly, our simulations with more heterogeneous “mixed” wastewater samples (those with Freyja-calling abundance > 55% for a specific variant) demonstrate that fewer than ten such samples per variant still generate reliable reference barcodes (Supplementary Fig. 6, part II). This robustness persists regardless of geographical or temporal distribution of the samples, and remains consistent across varying sequencing quality metrics (depth: 10x–50x; coverage: 40%–80%), highlighting ICA-Var’s adaptability to real-world surveillance conditions.

For variant detection, ICA-Var compares projected-back weekly sources with clinical sequencing data (Methods), making one detection call per variant per week based on all collected samples. Simulations demonstrate that ICA-Var can detect variants with one “real” sample (>95% Freyja-calling abundance) per week, matching the performance of univariate methods Freyja (Supplementary Fig. 7). This performance remains consistent across different quality control metrics. Compared to univariate methods, ICA-Var enables earlier detection when multiple samples show low but reliable variant signals (few dominant mutation sites with low mutation frequencies). Our simulations indicate that just two to five such samples are sufficient for early variant detection (Supplementary Fig. 8).

Discussion

Wastewater-based epidemiology (WBE) offers a unique opportunity to monitor the emergence and spread of SARS-CoV-2 variants at the population level. Our proposed pipeline demonstrates early detection of the SARS-CoV-2 VoCs in wastewater preceding identification in clinical data for most VoCs. We further show the spatial and temporal dynamics of most emerging SARS-CoV-2 VoCs transitioning from urban to rural areas. Leveraging the data-driven nature of our proposed pipeline, ICA-Var identifies modules of mutations in the SARS-CoV-2 genome that are consistent with parallel time-changing patterns, and consequently gave rise to VoCs from August 2021 to November 2023. The proposed method offers an opportunity to identify mutation sites that are occurring simultaneously and could lead to potential novel variants, even in the absence of clinical data.

Enhanced sensitivity and specificity in SARS-CoV-2 VoC detection

Wastewater samples are a composite of multiple clinical genomes spanning a local community at a given time point⁴¹. COVID-19 clinical testing and reports indicate that certain VoCs were prevalent at specific time points from August 2021 to November 2023^{42, 43–44}. Our method, ICA-Var enables the separation of multiple genomic signals into independent sources⁴⁵. Utilizing a significant number of wastewater samples spanning this timeframe, retrospective ICA can uncover the original mutation profile, each representing an individual or a set of similar VoCs spanning communities at different time points (Supplementary Fig. 9). Notable, ICA-Var can handle non-Gaussian and non-linearly mixed signals, operates without the need for prior knowledge, and performs blind source separation⁴⁵. These inherent properties make ICA robust to our real-world application of de-mixing wastewater samples, enabling the identification of clinically-relevant VoCs.

Following group ICA, we performed the dual-regression analysis to re-associate the original source with weekly samples to investigate and characterize the signals from ICA within each week. Previous studies have applied the dual-regression method in functional magnetic resonance imaging data analysis to associate group networks (i.e., sources) identified by ICA with individual brain maps^{46, 47, 48–49}. This approach projects group sources onto weekly samples (Supplementary Fig. 9), allowing VoC detection through comparison of projected weekly sources with clinical sequencing data (i.e., reference barcodes). Each variant receives one detection call per week based on all collected samples, enhancing signal specificity and improving mutation pattern localization. Notably, the earliest detection date of ICA-Var using combined urban and rural wastewater samples often precedes detection dates obtained from analyzing either sample subset alone (Fig. 3C). This apparent discrepancy arises directly from the multivariate nature of ICA-Var, which leverages collective signal strength across the entire sample set. By integrating subtle signals that individually fall below detection thresholds, the method achieves enhanced statistical power for early variant detection - a key advantage over analyses limited to isolated geographic subsets.

The number of ICA components (n_ica) corresponds directly to the underlying independent sources in the sequencing datasets, primarily reflecting the circulating VoCs during the sample collection period. While this parameter requires optimization for each surveillance scenario, our approach provides robust guidelines for selection. Our simulations demonstrate that the MDL criterion accurately estimates the true number of variants across diverse QC conditions, with particularly strong correlations observed using higher quality samples (50x depth with 80% coverage). Importantly, when applied to real-world data, our sensitivity analysis confirms that early detection results remain stable for the majority of variants (8 out of 12 VoCs reported in Fig. 3C, including BF.1, BQ.1, XBB.1.5, XBB.1.9, XBB.2.3, EG.5, HV.1, and BA.2.86), even when varying n_ica substantially (from 10 to 55 in 5-component increments, Supplementary Table). This robustness to parameter selection further validates ICA-Var’s reliability for real-time surveillance applications.

Our simulation results show that despite using many samples with stringent QC metrics, ICA-Var needs only eight “real” samples per variant to generate reliable data-driven reference barcodes, even with relaxed QC standards (Supplementary Figs. 6). While increased sample numbers and higher QC metrics improve accuracy, the method remains reliable with reduced depth and coverage levels. For weekly variant detection, ICA-Var performs comparably to Freyja with high-prevalence variants, even under relaxed QC metrics (Supplementary Fig. 7). The method shows superior performance when analyzing multiple samples with low variant prevalence, demonstrating the advantage of its multivariate approach (Supplementary Fig. 8).

Collectively, as compared to the state-of-the-art Freyja tool that analyzes each individual wastewater sample with a univariate approach, the proposed pipeline boosts the statistical power and enables the earlier and more accurate detection of each VoC (Fig. 1 and Fig. 3). The intention to incorporate deletion information in our proposed analyses also contributes to enhanced sensitivity and accuracy (Supplementary Fig. 4). Earlier detection of VoCs from wastewater data then enables public health authorities to implement timely and targeted interventions to mitigate the spread of the virus⁴¹.

ICA-Var does not require clinical data to identify a novel variant

Wastewater surveillance and clinical sequencing data of SARS-CoV-2 can provide a comprehensive understanding of the emergence, spread and prevalence of a virus^34,50,51. A time-dependent and accurate reference barcode for circulating VoCs are often required to identify emerging variants^35,36. Therefore, these methods (such as Freyja and COJAC) are potentially restricted from identifying or forecasting potential novel variants in the absence of clinical sequencing data and require a “correct” barcode of circulating VoCs for accurate detection.

As a data-driven approach, our proposed pipeline (Fig. 1) recognizes mutation sites within the SARS-CoV-2 genome through the identification of co-varying and time-evolving patterns in group sources. This crucial step allows us to identify contributing mutation sites for various VoCs in wastewater at different time points from August 2021 to November 2023, without any prior knowledge of circulating VoCs (Fig. 4). The proposed pipeline additionally identifies co-varying mutation patterns that are more recent and contribute to emerging group sources that have not been reported in clinical sequencing data (Fig. 5). Therefore, these mutation sites could potentially give rise to novel SARS-CoV-2 variants.

VoCs spread from urban to rural areas

Besides methodological developments, our study shows that each SARS-CoV-2 VoC is in general initially detected in wastewater samples from urban areas and later in wastewater samples from rural areas (Fig. 3). This observation is in concordance with a previous report on COVID-19 epidemic dynamics in the United States⁵. In addition to highlighting these dynamic patterns, our results underscore the feasibility of monitoring SARS-CoV-2 VoCs through wastewater samples obtained from rural areas. Given reports that residents in rural locations are at a higher risk for disease and often lack healthcare resources⁷, WBE provides a practical and effective way to monitor the disease emergence and estimate the disease spread and prevalence.

Limitations

As a data-driven method, ICA-Var requires a significant number of samples with high genome coverage and depth to produce stable results. Therefore, the proposed pipeline may not be suitable for scenarios with a limited number of wastewater samples or if sequencing metrics indicate genome coverage below 50% and low sequencing depth (i.e., less than 10 reads per sequenced base). Though ICA-Var could still produce results under these scenarios, user-input will be required to carefully evaluate the output. Moreover, one assumption of ICA-Var is that sources are independent and linearly separable. In our application, the independence of underlying signals comes from different dominant mutation sites for various VoCs and different dominance of VoCs at different time points. Both conditions demand a relatively large number of wastewater samples to generate meaningful results. The multivariate nature of the proposed pipeline would only demonstrate superiority in detecting the presence of VoCs in multiple wastewater samples, in our case, multiple samples from various wastewater sampling locations in southern Nevada. Finally, the proposed pipeline cannot estimate the abundance of each VoC from the wastewater sample. This limitation arises from ICA-Var’s inability to distinguish between signal and noise in mixed data, and its treatment of each source with equal weight. While the former may not pose a significant concern in our application, given the bioinformatics processing pipeline’s retention of relevant SARS-CoV-2 signals, the latter impedes our ability to discern the abundance or significance of each identified source.

Methods

Wastewater sample collection, processing, and sequencing

A total of 3659 wastewater samples were collected from urban and rural locations in Southern Nevada from August 2021 to November 2023. After collection, samples were placed on ice in the field and stored under refrigeration until processing (hold time <36 h). Nucleic acids from wastewater samples were isolated using the Promega Wizard Enviro Total Nucleic Acid Kit (Cat #A2991) following the manufacturer’s protocol. In addition, we modified the Promega protocol by lysing wastewater with the protease solution and binding free nucleic acids using NucleoMag Beads from Macherey-Nagel (Cat #744970). Total RNA (>10 ng) was processed for first-strand cDNA synthesis using the LunaScript RT SuperMix Kit (New England BioLabs). Amplicon-based sequencing libraries were constructed using the CleanPlex SARS-CoV-2 FLEX Panel from Paragon Genomics. Libraries were sequenced on an Illumina NextSeq 500 or NextSeq 1000 platform with 300 cycle flow cells.

Wastewater sequence data processing

Processing of sequencing data followed a modification of our previously published pipeline¹⁹. Upon sequencing, Illumina adapter sequences were trimmed from read pairs using cutadapt version 4.2⁵². Sequencing reads were then mapped to the SARS-CoV-2 reference genome (NC_045512.2) using bwa mem, version 0.7.17-r1188⁵³. Paragon Genomics CleanPlex SARS-CoV-2 FLEX tiled-amplicon primers were trimmed from the aligned reads using fgbio TrimPrimers version 2.1.0 in hard-clip mode. Variants were called by iVar variants v1.4.1⁴⁰ using mutation sites with alternative allele frequencies with respect to the reference SARS-CoV-2 2020 initial genome⁵⁴, Genome coverage and read depth were calculated using samtools v1.16.1⁵⁵. After removing duplicated samples and positive/negative controls, 2,684 samples remained for quality control (QC) analysis (Supplementary Fig. 1). Strict QC was enforced as only wastewater samples with 50x depth covering more than 80% of SARS-CoV-2 genome were retained in the following analyses. Collectively, a total of 1,385 samples, from August 2021 to November 2023, covering 59,422 mutation sites of SARS-Cov-2 variants were used for the following analyses (Supplementary Figs. 1 and 2C-D).

Public health sample analyses

We analyzed 8,810 high-coverage clinical SARS-CoV-2 sequences from Nevada, spanning September 2021 to November 2023, downloaded from GISAID (Supplementary Fig. 10).

Retrospective independent component analysis of Variants (ICA-Var)

Mathematically, let $Y \in R^{1,385 \times 59,422}$ denote the 59,422 mutation frequencies (i.e., the proportion of reads at a site that contains the mutation) from 1385 wastewater samples. Since wastewater samples are aggregations of genomes from multiple infected individuals with various virus lineages, $Y$ could be considered as a multivariate mixed signal of SARS-CoV-2 variants spanning the local community. The data-driven ICA approach separates this multivariate signal into additive subcomponents⁴⁵: $Y = AS$ , where $A \in R^{1,385 \times n_{ica}}$ denotes the mixing matrix and $S \in R^{n_{ica} \times 59,422}$ represents the source matrix (Fig. 1A, shaded gray box and Supplementary Fig. 9). In ICA-Var, the number of ICA components ( $n_{i c a}$ ) was determined from the minimum description length (MDL) criterion, and fastICA algorithm was utilized to perform ICA. Similar to the Bayesian information criterion (BIC), MDL provides stronger penalties for model complexity, leading to a preference for simpler models, as compared to Akaike information criterion (AIC)⁵⁶. In our analysis, ICA was repeated 50 times with different initial values and components from each run were clustered and visualized using ICASSO⁵⁷, a software for investigating the reliability of ICA estimates by clustering and visualization. Only reliable estimates corresponding to tight clusters were retained as final sources ( $S$ , Supplementary Fig. 1A).

The original ICA method assumes that all sources (i.e., subcomponents S) are non-Gaussian and that the sources are statistically independent from one another⁴⁵. In analyzing our wastewater samples using ICA-Var, the independence comes mostly from different mutation patterns and various circulating windows of time for each VoCs^{42, 43–44}. We quantitatively validated this independence assumption by calculating pairwise Spearman’s correlation coefficients among the obtained sources (S), which revealed consistently low values (mean absolute correlation: 0.08 ± 0.08). The non-Gaussian nature of these components, another critical ICA assumption, derives primarily from the inherent sparsity in our input data matrix $Y$ and was statistically confirmed through Kolmogorov-Smirnov tests, which rejected the null hypothesis of normal distribution for each source in S at α = 0.05. These validations ensure that the mathematical foundations of ICA are appropriately satisfied in our application. In this case, the source matrix could represent a co-varying mutation pattern in different time windows, and could therefore serve as data-driven reference barcodes for mutation frequencies co-existing in wastewater samples. From this perspective, the ICA-Var pipeline can be considered as running a multivariate regression and determining the design matrix (S, i.e., reference barcodes) from the data under the constraint of independence of the sources. In our study, we conducted a retrospective analysis using ICA-Var on 1,385 wastewater samples (Supplementary Fig. 1) spanning from August 2021 to November 2023. We predicted that the ICA sources could capture the evolving dominant SARS-CoV-2 VoCs over time, each characterized by unique determinant mutation patterns.

Next, we identified a set of contributing mutations (i.e., significant mutations) for each source in $S$ . We identified significant mutations for each source by selecting mutation sites with values exceeding mean ± 2 standard deviations (representing 4.55% of all mutation sites) in each row of S⁴⁹. The contribution of each mutation site reflects its overall prevalence across all samples analyzed by ICA-Var. A binary matrix $\hat{S}$ was then computed to retain only these contributing mutations (Supplementary Fig. 2B). The pipeline developed in this manuscript and the data used to generate the results are available at https://github.com/zhuangx15/ICAvar.

Dual-regression to back-project source matrix onto weekly wastewater samples

To further determine VoCs for each week, we performed a dual-regression analysis⁴⁶ to project the ICA source matrix (S) back onto weekly wastewater samples (Fig. 1A, shaded gray boxes and Supplementary Fig. 9). The term “dual-regression” stems from the utilization of two regression procedures employed to estimate source and de-mixing dynamics for each week against the original data. More specifically, let $y_{i} \in R^{N s a m p l e_{i} \times 59,422}$ denote mutation frequencies for $N s a m p l e_{i}$ samples in the $i^{t h}$ week, we then, 1) used the all-sample source matrix (S) as a set of source regressors in a general linear model (GLM), to find week-specific de-mixing dynamics ( $a_{i} = y_{i} S^{- 1}, a_{i} \in R^{N s a m p l e_{i} \times n_{i c a}}$ ) associated with all-sample source matrix (S); and 2) used week-specific de-mixing dynamics ( $a_{i}$ ) as a set of regressors in a second GLM, to find the week-specific source matrix ( $s_{i} = {a_{i}^{- 1} y}_{i}, s_{i} \in R^{n_{i c a} \times 59,422}$ ) that were still associated with the all-sample source matrix (S). This process yields pairs of estimates forming a dual space, jointly providing the best approximation for the original all-sample ICA source matrix in each weekly sample. In summary, we obtained dual-regressed week-specific source matrix $s_{i}$ for 113 weeks from August 2021 to November 2023.

ICA source matrix annotation to SARs-CoV-2 VoCs

To delineate the VoCs each week, we annotated our dual-regressed ICA source matrix $s_{i}$ by comparing them against the known mutations in VoCs from clinical SARS-CoV-2 sequencing data (Fig. 1A, bottom row). Next, we focused on 18 VoCs that have either been or were circulating in Southern Nevada between 2021 and 2023. Due to the potential shared dominant mutations among VoCs during evolution, a hierarchical structure was formed from these 18 VoCs based on the phylogenetic tree from www.covspectrum.org (Fig. 1B, top panel). Dominant mutation sites for each VoC were determined as follows: 1) mutations with more than 90% prevalence among clinical sequences reported at www.covspectrum.org were retained; 2) for lineages in level 2, 3, and 4 of the hierarchical tree, mutations that existed in their higher level VoCs were excluded to maintain a unique determinant mutation set (Fig. 1A, button row); and 3) mutations with both substitutions and deletions were included in step 1) and 2). The number of dominant mutations of each VoC were listed in parentheses in Fig. 1B, top panel.

We next binarized each row of week-specific source matrix ( $s_{i}$ ) by keeping only mutations with values greater than mean ± 2 standard deviations, as these mutations were contributing significantly towards the source ( $\hat{s_{i}}$ ). We annotated the binarized week-specific source matrix ( $\hat{s_{i}}$ ) using the dominant mutations from the VoCs by computing the following six matrices: 1) Spearman’s rank correlation coefficient ( $ρ$ ); 2) sensitivity; 3) specificity; 4) area under the receiver operating characteristic curve (AUC); 5) F1 score; and 6) Jaccard Index (JI). As shown in the dendrogram of these six matrices (Supplementary Fig. 11C), JI and F1score, AUC and sensitivity were highly similar to each other, respectively. The measure of specificity is dependent on the count of non-dominant mutations within each VoC. This count is arbitrarily determined and influences the number of dominant mutations observed in other VoCs. Therefore, we established our annotation criteria using the F1 score, sensitivity, and the Spearman’s correlation values (Supplementary Fig. 11B), and further based on hierarchical levels and number of dominant mutations of each VoC (detailed in Fig. 1B).

We compared VoC annotations of the proposed pipeline against results from the state-of-the-art tool Freyja³⁵ (version 1.4.5, Fig. 1A, dashed boxes). For this comparison, Freyja was retrospectively and independently applied to each of the 1,385 samples, utilizing a barcode comprising 18 VoCs, generated in October 2023. We organized samples into individual weeks, ranging from August 2021 to November 2023. In each week, if the results from Freyja indicated that any wastewater sample contained a VoC with an abundance exceeding 15%, we considered this VoC as detected by Freyja in that specific week (Supplement Figure 11A and Fig. 1C).

Potential novel mutations

Given that the identification of contributing mutation sites did not necessitate prior knowledge of reference barcodes for VoCs, ICA-Var provides a distinctive approach to discern emerging mutations. This capability extends to contributions that might give rise to novel lineages across local communities, even in the absence of clinical sequencing data. From this perspective, we focused on capturing the time-evolving contributions of significant mutations in each week-specific source matrix (Fig. 1A, top row), using: | $s_{i}^{(j)} ∣ = T β + ϵ$ , where $i = 1, \dots, 113,$ $s_{i}^{(j)}$ denoted the source values for $j^{t h}$ contributing mutations in $i^{t h}$ week, T denoted a time vector for 113 weeks from August 2021 to November 2023, and $β$ represented the time-evolving effect of each contributing mutation. Since flipping signs of the de-mixing matrix in ICA would result in a flipping sign of the source matrix⁴⁵, we focused on the amplitude (absolute value) of each week-specific source $(s_{i})$ .

A significant $β$ indicated a critical time-changing contribution for this mutation from 2021 to 2023. As a proof of concept, we cross-checked mutations with significant $β$ against known dominant mutations in Delta (B.1.617.2), Omicron (BA.1), and more recent XBB.1 variants, and examined their time-evolving contributions. Following clinical reports, dominant mutations in Delta variant (B.1.617.2) should demonstrate significant contributions in 2021, dominant mutations in Omicron variant (BA.1) should demonstrate significant contributions from late 2021 to 2022, and dominant mutations in XBB.1 variants should demonstrate significant contributions after late 2022.

Subsequently, we refined our pool of potential novel mutations by cross-referencing known SARS-CoV-2 variants with mutations exhibiting significant time-evolving contributions. Among the mutations retained, those demonstrating emerging contributions in more recent weeks were identified as candidates with the potential to give rise to novel lineages. To delve deeper, our focus extended to their time-evolving contributions over recent three months, from August 2023 to November 2023. We further performed a hierarchical clustering on these mutations with contributions in the three months as features, to identify co-varying patterns among these mutations. To validate the identified co-varying patterns, we examined co-varying patterns of mutations within each cluster, and cross-referenced them with dominant mutations from the emerging VoCs EG.5, HV.1, and BA.2.86.

The significant contributing mutation sites identified by ICA-Var did not necessitate prior knowledge of reference barcodes for VoCs, and thereby provided a data-driven candidate pool that could be aligned with both known and emerging variant mutation profiles. To further evaluate how different thresholds affect the identification of significant mutations, we conducted a sensitivity analysis by altering this parameter. Compared to the reported mean ± 2 standard deviations used in thresholding ICA source matrix S, we considered mean ± 1 standard deviation and mean ± 3 standard deviations as lenient and stringent thresholds, respectively. With each threshold, we identified mutations with significant time-evolving contributions and cross-referenced them with known and emerging variant mutation profiles.

Simulation to demonstrate the number of samples and QC metrics required for ICA-Var

For ground truth in our simulations, we designated wastewater samples with Freyja-calling variant abundance greater than 95% for individual variants as “real” samples. Approximately 25% of our wastewater samples qualified as “real” for a single variant. We generated simulated datasets by combining varying numbers of “real” samples ( $[1 : 1 : 10, 12 : 2 : 20, 25 : 5 : 50]$ for each SARS-CoV-2 variant under different QC metrics (50x, 20x, or 10x depth at >80% or >40% coverage). Only variants with at least 10 “real” samples under each QC metric were included, causing the number of analyzed variants to vary slightly under different QC metrics.

This process was repeated for “mixed” samples, i.e., samples with >55% Freyja-calling abundance for an individual VoC. The corresponding VoC with >55% Freyja-calling abundance was considered the ground truth for each sample in the simulation. This updated criterion significantly increased the number of VoCs in the simulation, thereby enhancing the evaluation of ICA-Var’s performance in generating data-driven reference barcodes for VoCs with both unique and overlapping mutations from mixed datasets.

We processed these simulated datasets through ICA-Var and compared the resulting group independent sources against Freyja’s clinical barcodes to assess their reliability as data-driven reference barcodes. We evaluated performance using sensitivity and area under the ROC curve (AUC, the average of sensitivity and specificity). These measures were plotted against the number of “real” samples used for each variant under each QC metric. We used elbow criteria on sensitivity curves to determine the optimal number of samples needed for reliable barcodes.

After establishing reliable data-driven barcodes, we tested variant detection performance under various conditions. Using barcodes generated from high-quality data (50x depth, 80% coverage) with 20 samples per variant, we simulated detection of B.1.617.2, BA.1, BA.2, BF.7, BQ.1, and XBB.1.5.

We assembled “weekly” datasets in two ways. First, to compare performance with Freyja, we used “real” samples (>95% abundance) under various QC metrics. Second, to demonstrate ICA-Var’s advantage in early detection, we used samples with low variant prevalence (1–10% abundance, with few dominant mutation sites and low mutation frequencies). These low-abundance simulations were conducted separately for each variant to avoid cross-variant interactions.

Weekly sources were generated by projecting data-driven barcodes onto simulated weekly data using dual-regression. These were compared against Freyja’s barcodes and evaluated using our calling pipeline (Fig. 1B) to determine variant presence.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Acknowledgements

This work is supported by NIH grants: GM103440 (VV, ECO), MH109706 (VV, ECO), AG071566 (XZ, DC); a CARES Act grant from the Nevada Governor’s Office of Economic Development (VV, ECO); Centers for Disease Control and Prevention grants NH75OT000057-01-00 (VV, CL, DG, HK, and ECO). The project contents are solely the responsibility of the authors and do not necessarily represent the official views of the Centers for Disease Control and Prevention. This work is additionally supported by the Nevada Water Resources Research Institute/USGS under Grant/Cooperative Agreement No. G21AP10578 through the Division of Hydrologic Sciences at Desert Research Institute (DPM). We would like to acknowledge personnel at DRI and the collaborating wastewater agencies for their assistance with sample logistics and data access. Special thanks to Daniel Fischer, Latoya Blanche, LeAnna Risso and Alycia S. Ybarra; to Sean Twomey, James Eason, Bill Coates, Brian Magna, Jose Rodriguez, George Veliz, Deborah Woodland; and to Chad Marchand, Teresa Gomez, and Jeremy Singleton for contributing samples from Wastewater Treatment Plants. We additionally thank Dr. Mira Han for her time reviewing this manuscript.

Author contributions

Conception and Design: X.Z., V.V., C.L., D.G., H.Y.K., E.C.O.; Data Acquisition: V.V., M.M., K.D., N.G., S. Akbar, C.L.C., A.K.Y., E.B., W.B., H.Z., S. Afzal, D.P.M.; Data Analysis: X.Z., V.V., M.M., C.L.C., D.C., E.C.O.; Manuscript Preparation: X.Z., V.V., M.M., E.C.O.

Peer review

Peer review information

Nature Communications thanks David Larsen, and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.

Data availability

The Sequencing data used in this manuscript is available at SRA via BioProject accession PRJNA1275998. SARS-CoV-2 clinical multiple sequence alignments and their associated metadata used in the clinical data analysis and database construction are available via GISAID. The dataset used to generate the results are available at https://github.com/zhuangx15/ICAvar. Source data for each figure is provided with this paper. are provided with this paper.

Code availability

The pipeline developed in this manuscript and the data used to generate the results are available at https://github.com/zhuangx15/ICAvar.

Competing interests

The authors declare no competing interests.

Supplementary information

The online version contains supplementary material available at https://doi.org/10.1038/s41467-025-61280-5.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Espinoza, B. et al. Coupled models of genomic surveillance and evolving pandemics with applications for timely public health interventions. Proc. Natl. Acad. Sci. USA120, e2305227120 (2023).

2. Li, J; Lai, S; Gao, GF; Shi, W. The emergence, genomic diversity and global spread of SARS-CoV-2. Nature; 2021; 600, pp. 408-418.2021Natur.600.408L1:CAS:528:DC%2BB3MXis12qtr%2FO [DOI: https://dx.doi.org/10.1038/s41586-021-04188-6] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34880490]

3. Ling-Hu, T., Rios-Guzman, E., Lorenzo-Redondo, R., Ozer, E. A. & Hultquist, J. F. Challenges and opportunities for global genomic surveillance strategies in the COVID-19 era. Viruses14, 2532 (2022).

4. Robishaw, JD et al. Genomic surveillance to combat COVID-19: challenges and opportunities. Lancet Microbe; 2021; 2, pp. e481-e484.1:CAS:528:DC%2BB3MXitF2ls7bM [DOI: https://dx.doi.org/10.1016/S2666-5247(21)00121-X] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34337584][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8315763]

5. Cuadros, DF; Branscum, AJ; Mukandavire, Z; Miller, FDW; MacKinnon, N. Dynamics of the COVID-19 epidemic in urban and rural areas in the United States. Ann. Epidemiol.; 2021; 59, pp. 16-20. [DOI: https://dx.doi.org/10.1016/j.annepidem.2021.04.007] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33894385][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8061094]

6. Paul, R; Arif, AA; Adeyemi, O; Ghosh, S; Han, D. Progression of COVID-19 from urban to rural areas in the United States: a spatiotemporal analysis of prevalence rates. J. Rural Health; 2020; 36, pp. 591-601. [DOI: https://dx.doi.org/10.1111/jrh.12486] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32602983][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7361905]

7. Souch, JM; Cossman, JS. A commentary on rural‐urban disparities in COVID‐19 testing rates per 100,000 and risk factors. J. Rural Health; 2021; 37, 188. [DOI: https://dx.doi.org/10.1111/jrh.12450] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32282964]

8. Jones, M., Bhattar, M., Henning, E. & Monnat, S. M. Explaining the U.S. rural disadvantage in COVID-19 case and death rates during the Delta-Omicron surge: the role of politics, vaccinations, population health, and social determinants. Soc. Sci. Med.335, 116180 (2023).

9. Fontenele, RS et al. High-throughput sequencing of SARS-CoV-2 in wastewater provides insights into circulating variants. Water Res.; 2021; 205, 1:CAS:528:DC%2BB3MXitFGht7rK [DOI: https://dx.doi.org/10.1016/j.watres.2021.117710] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34607084][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8464352]117710.

10. Gerrity, D., Papp, K., Stoker, M., Sims, A. & Frehner, W. Early-pandemic wastewater surveillance of SARS-CoV-2 in Southern Nevada: methodology, occurrence, and incidence/prevalence considerations. Water Res. X10, 100086 (2021).

11. Medema, G; Been, F; Heijnen, L; Petterson, S. Implementation of environmental surveillance for SARS-CoV-2 virus to support public health decisions: opportunities and challenges. Curr. Opin. Environ. Sci. Health; 2020; 17, pp. 49-71. [DOI: https://dx.doi.org/10.1016/j.coesh.2020.09.006] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33024908][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528975]

12. Ahmed, W. et al. First confirmed detection of SARS-CoV-2 in untreated wastewater in Australia: a proof of concept for the wastewater surveillance of COVID-19 in the community. Sci. Total Environ.728, 138764 (2020).

13. Hart, O. E. & Halden, R. U. Computational analysis of SARS-CoV-2/COVID-19 surveillance by wastewater-based epidemiology locally and globally: feasibility, economy, opportunities and challenges. Sci Total Environ730, 138875 (2020).

14. Crank, K., Chen, W., Bivins, A., Lowry, S. & Bibby, K. Contribution of SARS-CoV-2 RNA shedding routes to RNA loads in wastewater. Sci Total Environ806, 150376 (2022).

15. Wölfel, R et al. Virological assessment of hospitalized patients with COVID-2019. Nature; 2020; 581, pp. 465-469.2020Natur.581.465W [DOI: https://dx.doi.org/10.1038/s41586-020-2196-x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32235945]

16. Jakariya, M. et al. Wastewater-based epidemiological surveillance to monitor the prevalence of SARS-CoV-2 in developing countries with onsite sanitation facilities. Environ. Pollut.311, 119679 (2022).

17. Stockdale, S. R. et al. RNA-Seq of untreated wastewater to assess COVID-19 and emerging and endemic viruses for public health surveillance. Lancet Reg. Health Southeast Asia14, 100205 (2023).

18. Betancourt, W. Q. et al. COVID-19 containment on a college campus via wastewater-based epidemiology, targeted clinical testing and an intervention. Sci. Total Environ.779, 146408 (2021).

19. Vo, V. et al. Use of wastewater surveillance for early detection of Alpha and Epsilon SARS-CoV-2 variants of concern and estimation of overall COVID-19 infection burden. Sci. Total Environ.835, 155410 (2022).

20. Harrington, A. et al. Environmental surveillance of flood control infrastructure impacted by unsheltered individuals leads to the detection of SARS-CoV-2 and novel mutations in the spike gene. Environ Sci Technol Letthttps://doi.org/10.1021/ACS.ESTLETT.3C00938 (2024).

21. Kayikcioglu, T. et al. Performance of methods for SARS-CoV-2 variant detection and abundance estimation within mixed population samples. PeerJ11, e14596 (2023).

22. Amman, F et al. Viral variant-resolved wastewater surveillance of SARS-CoV-2 at national scale. Nat. Biotechnol.; 2022; 40, pp. 1814-1822.1:CAS:528:DC%2BB38XhvVGmu7vO [DOI: https://dx.doi.org/10.1038/s41587-022-01387-y] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35851376]

23. Brunner, FS et al. City-wide wastewater genomic surveillance through the successive emergence of SARS-CoV-2 alpha and delta variants. Water Res; 2022; 226, 1:CAS:528:DC%2BB38XivVSnsrfF [DOI: https://dx.doi.org/10.1016/j.watres.2022.119306] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36369689][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9614697]119306.

24. Vo, V et al. Detection of the Omicron BA.1 variant of SARS-CoV-2 in wastewater from a Las Vegas tourist area. JAMA Netw. Open; 2023; 6, E230550. [DOI: https://dx.doi.org/10.1001/jamanetworkopen.2023.0550] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36821109][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9951036]

25. Harrington, A. et al. Urban monitoring of antimicrobial resistance during a COVID-19 surge through wastewater surveillance. Sci. Total Environ.853, 158577 (2022).

26. Vo, V. et al. Identification of a rare SARS-CoV-2 XL hybrid variant in wastewater and the subsequent discovery of two infected individuals in Nevada. Sci Total Environ858, (2023).

27. Crits-Christoph, A et al. Genome sequencing of sewage detects regionally prevalent SARS-CoV-2 variants. mBio; 2021; 12, pp. 1-9. [DOI: https://dx.doi.org/10.1128/mBio.02703-20]

28. Wurtz, N. et al. Monitoring the circulation of SARS-CoV-2 variants by genomic analysis of wastewater in Marseille, South-East France. Pathogens10, 1042 (2021).

29. Izquierdo-Lara, R et al. Monitoring SARS-CoV-2 circulation and diversity through community wastewater sequencing, the Netherlands and Belgium. Emerg. Infect. Dis.; 2021; 27, pp. 1405-1415.1:CAS:528:DC%2BB3MXhs1SrsLzF [DOI: https://dx.doi.org/10.3201/eid2705.204410] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33900177][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8084483]

30. Bar-Or, I. et al. Detection of SARS-CoV-2 variants by genomic analysis of wastewater samples in Israel. Sci. Total Environ.789, 148002 (2021).

31. Rainey, A. L. et al. Wastewater surveillance for SARS-CoV-2 in a small coastal community: effects of tourism on viral presence and variant identification among low prevalence populations. Environ. Res.208, 112496 (2022).

32. Pérez-Cataluña, A. et al. Spatial and temporal distribution of SARS-CoV-2 diversity circulating in wastewater. Water Res.211, 118007 (2022).

33. Gregory, D. A., Wieberg, C. G., Wenzel, J., Lin, C. H. & Johnson, M. C. Monitoring SARS-CoV-2 populations in wastewater by amplicon sequencing and using the novel program Sam Refiner. Viruses13, 1647 (2021).

34. Jahn, K et al. Early detection and surveillance of SARS-CoV-2 genomic variants in wastewater using COJAC. Nat. Microbiol.; 2022; 7, pp. 1151-1160.1:CAS:528:DC%2BB38XhvVGmu7rE [DOI: https://dx.doi.org/10.1038/s41564-022-01185-x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35851854][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9352586]

35. Karthikeyan, S et al. Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission. Nature; 2022; 609, pp. 101-108.2022Natur.609.101K1:CAS:528:DC%2BB38Xit1amu7rM [DOI: https://dx.doi.org/10.1038/s41586-022-05049-6] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35798029][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9433318]

36. Sapoval, N et al. Enabling accurate and early detection of recently emerged SARS-CoV-2 variants of concern in wastewater. Nat. Commun.; 2023; 14, pp. 1-7. [DOI: https://dx.doi.org/10.1038/s41467-023-38184-3]

37. Valieris, R et al. A mixture model for determining SARS-Cov-2 variant composition in pooled samples. Bioinformatics; 2022; 38, pp. 1809-1815.1:CAS:528:DC%2BB38XhtVCnsLrE [DOI: https://dx.doi.org/10.1093/bioinformatics/btac047] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35104309]

38. Posada-Cespedes, S et al. V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data. Bioinformatics; 2021; 37, pp. 1673-1680.1:CAS:528:DC%2BB3MXitlGgt7fM [DOI: https://dx.doi.org/10.1093/bioinformatics/btab015] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33471068][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8289377]

39. Wilm, A et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res.; 2012; 40, pp. 11189-11201.1:CAS:528:DC%2BC38XhvVCitLjJ [DOI: https://dx.doi.org/10.1093/nar/gks918] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23066108][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3526318]

40. Grubaugh, ND et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol.; 2019; 20, pp. 1-19. [DOI: https://dx.doi.org/10.1186/s13059-018-1618-7]

41. Singer, AC et al. A world of wastewater-based epidemiology. Nat. Water; 2023; 1, pp. 408-415. [DOI: https://dx.doi.org/10.1038/s44221-023-00083-8]

42. Yamasoba, D et al. Virological characteristics of the SARS-CoV-2 omicron XBB.1.16 variant. Lancet Infect. Dis.; 2023; 23, pp. 655-656.1:CAS:528:DC%2BB3sXptlWnurg%3D [DOI: https://dx.doi.org/10.1016/S1473-3099(23)00278-5] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37148902][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10156138]

43. Araf, Y et al. Omicron variant of SARS-CoV-2: genomics, transmissibility, and responses to current COVID−19 vaccines. J. Med. Virol.; 2022; 94, pp. 1825-1832.1:CAS:528:DC%2BB38XhvFGntrY%3D [DOI: https://dx.doi.org/10.1002/jmv.27588] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35023191][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9015557]

44. Mlcochova, P et al. SARS-CoV-2 B.1.617.2 delta variant replication and immune evasion. Nature; 2021; 599, pp. 114-119.2021Natur.599.114M1:CAS:528:DC%2BB3MXit1Ciu7%2FM [DOI: https://dx.doi.org/10.1038/s41586-021-03944-y] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34488225][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8566220]

45. Hyvärinen, A; Karhunen, J; Oja, E. Independent component analysis. Appl Comput Harmon Anal.; 2001; 21, pp. 135-144.2238671

46. Beckmann, C; Mackay, C; Filippini, N; Smith, S. Group comparison of resting-state FMRI data using multi-subject ICA and dual regression. Neuroimage; 2009; 47, S148. [DOI: https://dx.doi.org/10.1016/S1053-8119(09)71511-3]

47. Groves, AR; Beckmann, CF; Smith, SM; Woolrich, MW. Linked independent component analysis for multimodal data fusion. Neuroimage; 2011; 54, pp. 2198-2217. [DOI: https://dx.doi.org/10.1016/j.neuroimage.2010.09.073] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20932919]

48. Beckmann, CF; DeLuca, M; Devlin, JT; Smith, SM. Investigations into resting-state connectivity using independent component analysis. Philos. Trans. R. Soc. B: Biol. Sci.; 2005; 360, pp. 1001-1013. [DOI: https://dx.doi.org/10.1098/rstb.2005.1634]

49. Calhoun, V. D., Adali, T., Pearlson, G. D. & Pekar, J. J. A method for making group inferences from functional MRI data using independent component analysis. https://doi.org/10.1002/hbm (2001).

50. Hasing, M. E. et al. Wastewater surveillance monitoring of SARS-CoV-2 variants of concern and dynamics of transmission and community burden of COVID−19. Emerg. Microbes. Infect.12, 2233638 (2023).

51. Reynolds, LJ et al. SARS-CoV-2 variant trends in Ireland: wastewater-based epidemiology and clinical surveillance. Sci. Total Environ.; 2022; 838, 1:CAS:528:DC%2BB38XhsVCnsrfO [DOI: https://dx.doi.org/10.1016/j.scitotenv.2022.155828] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35588817][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9110007]155828.

52. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J.; 2011; 17, pp. 10-12. [DOI: https://dx.doi.org/10.14806/ej.17.1.200]

53. Li, H; Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics; 2009; 25, pp. 1754-1760.1:CAS:528:DC%2BD1MXot1Cjtbo%3D [DOI: https://dx.doi.org/10.1093/bioinformatics/btp324] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19451168][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234]

54. Wu, F et al. A new coronavirus associated with human respiratory disease in China. Nature; 2020; 579, pp. 265-269.2020Natur.579.265W1:CAS:528:DC%2BB3cXksFKlsLc%3D [DOI: https://dx.doi.org/10.1038/s41586-020-2008-3] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32015508][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7094943]

55. Li, H et al. The sequence alignment/Map format and SAMtools. Bioinformatics; 2009; 25, pp. 2078-2079. [DOI: https://dx.doi.org/10.1093/bioinformatics/btp352] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19505943][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002]

56. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. (Springer New York, 2009).

57. Himberg, J; Hyvärinen, A. ICASSO: software for investigating the reliability of ICA estimates by clustering and visualization. Neural Netw. Signal Process. - Proc. IEEE Workshop; 2003; 2003-Janua, pp. 259-268.

Word count: 8720

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Genome sequencing from wastewater enables accurate and cost-effective identification of SARS-CoV-2 variants. However, existing computational pipelines have limitations in detecting emerging variants not yet characterized in humans. Here, we present an unsupervised learning approach that clusters co-varying and time-evolving mutation patterns to identify SARS-CoV-2 variants. To build our model, we sequence 3659 wastewater samples collected over two years from urban and rural locations in Southern Nevada. We then develop a multivariate independent component analysis (ICA)-based pipeline to transform mutation frequencies into independent sources. These data-driven time-evolving and co-varying sources are compared to 8810 SARS-CoV-2 clinical genomes from Nevadans. Our method accurately detects the Delta variant in late 2021, Omicron variants in 2022, and emerging recombinant XBB variants in 2023. Our approach also reveals the spatial and temporal dynamics of variants in both urban and rural regions; achieves earlier detection of most variants compared to other computational tools; and uncovers unique co-varying mutation patterns not associated with any known variant. The multivariate nature of our pipeline boosts statistical power and supports accurate early detection of SARS-CoV-2 variants. This feature offers a unique opportunity to detect emerging variants and pathogens, even in the absence of clinical testing.

Wastewater surveillance can help in pandemic or outbreak response. Here, the authors report an unsupervised learning approach to detect emerging SARS-CoV-2 variants from rural and urban wastewater showing it achieves earlier detection than existing methods and detects new variants without clinical testing data.

Details

Title

Early detection of emerging SARS-CoV-2 Variants from wastewater through genome sequencing and machine learning

Author

Zhuang, Xiaowei¹; Vo, Van²

; Moshi, Michael A.³; Dhede, Ketan³

; Ghani, Nabih²

; Akbar, Shahraiz²; Chang, Ching-Lan³

; Young, Angelia K.⁴

; Buttery, Erin⁴; Bendik, William⁴

; Zhang, Hong⁴; Afzal, Salman⁴; Moser, Duane⁵

; Cordes, Dietmar⁶; Lockett, Cassius⁴; Gerrity, Daniel⁷; Kan, Horng-Yuan⁴; Oh, Edwin C.⁸

¹ University of Nevada Las Vegas, Laboratory of Neurogenetics and Precision Medicine, College of Sciences, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926); University of Nevada Las Vegas, Neuroscience Interdisciplinary Ph.D. program, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926); Cleveland Clinic Lou Ruvo Center for Brain Health, Las Vegas, USA (GRID:grid.239578.2) (ISNI:0000 0001 0675 4725)
² University of Nevada Las Vegas, Laboratory of Neurogenetics and Precision Medicine, College of Sciences, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926)
³ University of Nevada Las Vegas, Laboratory of Neurogenetics and Precision Medicine, College of Sciences, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926); University of Nevada Las Vegas, Neuroscience Interdisciplinary Ph.D. program, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926)
⁴ Southern Nevada Health District, Las Vegas, USA (GRID:grid.422451.4) (ISNI:0000 0004 0383 2216)
⁵ Desert Research Institute, Division of Hydrologic Sciences, Las Vegas, USA (GRID:grid.474431.1) (ISNI:0000 0004 0525 4843)
⁶ Cleveland Clinic Lou Ruvo Center for Brain Health, Las Vegas, USA (GRID:grid.239578.2) (ISNI:0000 0001 0675 4725)
⁷ P.O. Box 99954, Southern Nevada Water Authority, Las Vegas, USA (GRID:grid.509521.a) (ISNI:0000 0000 9767 1388)
⁸ University of Nevada Las Vegas, Laboratory of Neurogenetics and Precision Medicine, College of Sciences, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926); University of Nevada Las Vegas, Neuroscience Interdisciplinary Ph.D. program, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926); University of Nevada Las Vegas, Department of Brain Health, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926); University of Nevada Las Vegas, Department of Internal Medicine, Kirk Kerkorian School of Medicine at UNLV, Las Vegas, USA (GRID:grid.272362.0) (ISNI:0000 0001 0806 6926)

Pages

6272

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20411723

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41467-025-61280-5

ProQuest document ID

3227750527

Early detection of emerging SARS-CoV-2 Variants from wastewater through genome sequencing and machine learning

Jump to:

Full text

Abstract

Details

Suggested sources