INTRODUCTION
Despite covering less than 1% of the Earth's surface, freshwater ecosystems play a disproportionate role in biogeochemical and ecological processes (Dudgeon et al., 2006). They are incredibly diverse comprising more than 400 large-scale ecoregions (Abell et al., 2008), with rivers, lakes, and wetlands combined being home to more than 10% of all described species on Earth (fungi, plants, invertebrates, and vertebrates; Reid et al., 2019; Albert et al., 2021). Freshwater ecosystems also provide critical services to humans including drinking water, food, transportation, sanitation, and hydroelectric energy. The heavy reliance of human societies on freshwater has been a key driver of freshwater ecosystem decline over the past century. Habitat destruction, pollution, invasive species, overharvesting, all overlain by climate change, have contributed to rapid declines of freshwater biodiversity (IPBES, 2019). For example, 70% of the world's wetlands have been lost since 1900 (Davidson, 2014) and one-third of the freshwater species assessed by the International Union for Conservation of Nature (IUCN) Red List are threatened with extinction (IUCN, 2017). There is no one-fits-all solution to protect freshwater ecosystems: they are complex and diverse, with high levels of endemism, and face heterogeneous threats that vary in frequency, magnitude, and duration (Ahmed et al., 2022; Dudgeon et al., 2006). These perturbations can markedly disrupt ecological structure and function (Angeler et al., 2014; Dudgeon et al., 2006), so questions pertaining to biodiversity conservation, stewardship, and restoration of aquatic ecosystems necessarily involve quantification of interactions among species (e.g., effect of invasive species on native communities, altered trophic complexity, and trophic cascades; Higgins & Vander Zanden, 2010; Karatayev & Burlakova, 2022; Piczak et al., 2023). It is crucial that we develop, validate, and standardize monitoring tools that allow rapid and comprehensive assessment of freshwater communities to better characterize community composition and change in responses to perturbations to inform conservation (Ficetola & Taberlet, 2022).
Environmental DNA (eDNA) approaches are increasingly used for freshwater monitoring, complementing traditional methods that rely on sampling organisms, like seining (Andres et al., 2023; Berger et al., 2020; Gibson et al., 2023), benthic invertebrate surveys (Duarte et al., 2021; Ji et al., 2022), or electrofishing (Goutte et al., 2020). While current eDNA-based approaches are limited to presence/absence or estimates of relative abundance (Beng & Corlett, 2020; Zaiko et al., 2018), they typically capture a greater range of species, including those that are rare and elusive, or detect some species not captured by traditional means (Fediajevaite et al., 2021; Hallam et al., 2021; Keck et al., 2021; Lyet et al., 2021; Maracle et al., 2024; Sales et al., 2020). The growing importance of eDNA methods to support freshwater conservation is highlighted by the fact that national and international conservation organizations and companies have begun to develop eDNA-based programs or services. For example, the International Union for the Conservation of Nature (IUCN) has launched eBioAtlas, a program that aims to fill global freshwater biodiversity gaps using eDNA technology on >30,000 freshwater samples from around the world. However, in almost 20 years of eDNA studies, evaluations of entire biotic communities remain rare, likely due to the substantial technical challenges and methodological constraints (Ficetola & Taberlet, 2022). The two major methodological limitations that influence detection success (species presence/absence) and the taxonomic resolution of such eDNA surveys are: (i) incomplete reference databases, and (ii) limits to taxonomic specificity and complementarity of the primer pair(s). Reference databases are usually geographically and taxonomically biased (Marques et al., 2021; Weigand et al., 2019) and this can only be solved by more collaborations between taxonomists and molecular biologists, such as the Barcode of Life initiative that aims to create a public collection of reference sequences from vouchered specimens of all species (Costa & Carvalho, 2007; Ratnasingham & Hebert, 2007). Databases must also be carefully curated and quality-checked as user-entered data can contain many errors associated with the production and deposition of the sequences (Keck et al., 2023). There is no consensus on the best strategy to overcome primer bias such as PCR affinity (Freeland, 2017). Current approaches combine group-specific primer pairs, universal primer pairs, or both universal and group-specific primer pairs for one or several genes (e.g., COI and 12S) (see review in Ficetola & Taberlet, 2022). While increasing the number of primer pairs augments the number of detected taxa, it also increases the cost, labor, and time of analyses, meaning all-inclusive, multi-marker community studies are rarely applied in monitoring programs. Ultimately, there is a trade-off between the number of primers to combine and the number of taxa detected. To our knowledge, despite the crucial importance of the primer pair choice in metabarcoding studies, there is currently only one set of tools for the selection of the optimal combination of eDNA metabarcoding primers (MultiBarcodeTools; Zhu & Iwasaki, 2023). It includes (i) a web platform (“MultiBarcodeSelector” and “MultiBarcodeMap”) which provides candidate primers for fish studies based on a list of targeted species and requirements (e.g., how many loci, amplicon length, geographic region), and (ii) a phylogeny-based pipeline (“MultiBarcodePipeline”) that evaluates the degree to which candidate primer pairs can distinguish among species. However, these in silico-based tools are limited by the presence of mitogenomes in the reference databases and the availability of information by geographic area. There is a need for a user-friendly tool, free of taxonomic, and reference sequence limitations that uses actual DNA metabarcoding data and that allows for highly customizable datasets and designation of focal taxonomic groups.
In this study, we had three objectives: (1) using water samples from natural systems, create an eDNA metabarcoding dataset with 14 primer pairs that were evaluated in silico, in vitro (five mock communities), and in vivo (water samples from aquaria); (2) create and evaluate a free, user-friendly online tool (SNIPe) to optimize primer pair selection for future surveys using either published or new eDNA metabarcoding data; and (3) using SNIPe, explore our 14 primer pair dataset (eDNA samples from natural water bodies) to derive subsets of informative, cost-effective primer pairs that maximize our ability to quantify species richness, spanning plankton, macrophytes, invertebrates, and vertebrates. Our case study focuses on the upper St. Lawrence River region in Ontario, Canada. The St. Lawrence River drains the Laurentian Great Lakes, which contain 21% of the world's surface freshwater; a system that has been affected by diverse anthropogenic stressors and has experienced widespread impacts on its environmental health including biodiversity (Smith et al., 2019). Loss of riparian habitats and coastal wetlands, as well as habitat degradation, are pronounced in the upper St. Lawrence River with a loss of approximately 70% of historic wetlands in South Ontario (Marbek, 2023).
MATERIALS AND METHODS
In silico evaluation
Selection of the genes and primer pairs: Completeness of reference databases
We identified the genes with the most complete reference database(s) for Eastern Canadian vertebrates (reptiles, amphibians, mammals, birds, and fish), invertebrates (zooplankton, annelids, and mollusks), aquatic plants and phytoplankton (e.g., cyanobacteria) by comparing the number of sequences available in NCBI (GenBank; Benson et al., 2008) and BOLD (Ratnasingham & Hebert, 2007) first in 2021 and then updated in 2023 before our final analyses. Six genes were evaluated: mitochondrial cytochrome oxidase 1—COI (GenBank, BOLD), mitochondrial cytochrome b—cytB (GenBank), mitochondrial 12S rRNA (GenBank), mitochondrial 16S rRNA (GenBank), nuclear 18S rRNA (GenBank) and chloroplast ribulose bisphosphate carboxylase rbcl (GenBank). The COI gene has already been identified as the “gold standard” for insect DNA barcoding (Elbrecht et al., 2019; Hebert et al., 2003; Meusnier et al., 2008; Porter & Hajibabaei, 2018; Ratnasingham & Hebert, 2007; Yu et al., 2012).
In silico evaluation of the sequences
We downloaded all GenBank and BOLD sequences of zooplankton, annelid, platyhelminth, mollusk, reptile, amphibian, fish, mammal, and bird families occurring in Ontario and Quebec using PrimerMiner (Elbrecht & Leese, 2016). The sequences of 13 orders of freshwater invertebrates, previously used in Elbrecht and Leese (2016, 2017), were also downloaded: Plecoptera, Ephemeroptera, Trichoptera, Coleoptera, Odonata, Megaloptera, Amphipoda, Isopoda, Diptera, Turbellaria, Polychaeta, Oligochaeta, and Hirudinea. As the program could not accommodate all sequence data simultaneously, a threshold of 2000 downloaded sequences was used as recommended by Elbrecht and Leese (2016). When available, the mitochondrial reads were aligned separately (i.e., for each group of taxa) using Multiple Alignment and Fast Fourier Transform (MAFFT) as implemented in Geneious Prime v7.1.9 (Kearse et al., 2012); the OTUs were mapped to the mitochondrial consensus sequence, and visualization of the alignments was made using PrimerMiner. The numbers of OTUs per group of taxa are detailed in Table S1.
We selected group-specific and universal primer pairs from the literature (Table 1): (i) three universal mitochondrial COI primer pairs (Folmer et al., 1994; Leray et al., 2013; Tournayre et al., 2020), (ii) one 16S rRNA (Klymus, Richter, et al., 2017) and two COI (Elbrecht & Leese, 2017; Leese et al., 2021) primer pairs for macroinvertebrates, (iii) one 12S rRNA (Miya et al., 2015) and one COI (Roy et al., 2018) fish primer pair, (iv) one rbcl plant primer pair (Coghlan et al., 2021), and (v) one 18S rRNA (Zhan et al., 2013) and one bacterial 16S rRNA (Armingohar et al., 2014) primer pair for plankton (cyanobacteria, diatoms, dinoflagellates, ciliates, rotifers, and crustaceans). Based on the sequence alignments, we modified four of these primer pairs to increase their in silico amplification success (e.g., degenerate bases to prevent mismatches): “16SMOL2,” “COIFishdegen,” “Uniorbcl,” and “UniPh” (Table 1).
TABLE 1 Details of primer pairs including primer names, target regions, sequences, focal taxa, amplicon size, and references.
Primer set | Gene | Sens | Primer name | Sequence (5′-3′) | Target region length (bp) | Reference | Target |
Leray | COI | F | mlCOIintF | GGWACWGGWTGAACWGTWTAYCCYCC | 313 | Leray et al., 2013 | Universal (metazoan) |
R | jgHCO2198 | TAIACYTCIGGRTGICCRAARAAYCA | |||||
Folmer | COI | F | LCO1490 | GGTCAACAAATCATAAAGATATTGG | 658 | Folmer et al., 1994 | Universal (metazoan) |
R | HCO2198 | TAAACTTCAGGGTGACCAAAAAATCA | |||||
MG2 | COI | F | MG2-LCO1490 | TCHACHAAYCAYAARGAYATYGG | 133 | Tournayre et al., 2020 | Designed for arthropods and vertebrates |
R | MG2-univR | ACYATRAARAARATYATDAYRAADGCRTG | |||||
B2 | COI | F | BF2 | GCHCCHGAYATRGCHTTYCC | 421 | Elbrecht & Leese, 2017 | Designed for freshwater invertebrates but also amplify vertebrates |
R | BR2 | TCDGGRTGNCCRAARAAYCA | |||||
fwhEPTD | COI | F | fwhF2 | GGDACWGGWTGAACWGTWTAYCCHCC | 142 | Leese et al., 2021 | Arthropod |
R | EPTDr2n | CAAACAAATARDGGTATTCGDTY | |||||
COIFish (R4) | COI | F | COIFishReg4F | TTAACATAAAACCTCCNGCHATYTC | 200 | Roy et al., 2018 | Fish |
R | FishR1 | TAGACTTCTGGGTGGCCAAAGAATCA | Ward et al., 2005 | ||||
COIFishdegen | COI | F | COIFishdegen | THAAYATRAARCCNCCNGCNRTHTC | 200 | This study | Fish |
R | FishR1degen | TANACYTCNGGRTGNCCRAARAATCA | |||||
MiFish | 12S | F | MiFish-U-F | GTCGGTAAAACTCGTGCCAGC | 172 | Miya et al., 2015 | Fish |
R | MiFish-U-R | CATAGTGGGGTATCTAATCCCAGTTTG | |||||
16SMOL | 16S | F | 16SMOL_F | RRWRGACRAGAAGACCCT | 183–310 | Klymus, Marshall, et al., 2017 | Mollusk |
R | 16SMOL_R | ARTCCAACATCGAGGT | |||||
16SMOL2 | 16S | F | 16SMOL_F2 | DRWRGACRANAAGACCCY | 183–310 | This study | Mollusk |
R | 16SMOL_R2 | ARKCCAACATCGAGGT | |||||
orbcl2 | rbcl | F | rbcl2-F | YGATGGACTTACNAGTCTTGATCGTTACAAAGG | 221 | Coghlan et al., 2021 | Plant |
R | rbcl2-R | GNCCATAYTTRTTCAATTTATCTCTTTCAACTTGGATNCC | |||||
Uniorbcl | rbcl | F | rbcl_UniF | GGRCTDACNAGYCTNGAYCGWTAYAARGG | 233 | This study | Plant |
R | rbcl_UniR | GDCCRTAYTTRTTHARYTTATCYCTYTC | |||||
Uni18S | 18S -V4 region | F | Uni18S | AGGGCAAKYCTGGTGCCAGC | 419 | Zhan et al., 2013 | Universal |
R | Uni18SR | GRCGGTATCTRATCGYCTT | |||||
UniPh | 18S -V4 region | F | UniPhF | AGGGCAAGYCTGGYGCCAGC | 310–620 | This study | Universal |
R | UniPhR | GABGGTATMTRAWCVTCYT | |||||
Uni16S | 16S-V3-V4 regions | F | 515F | ACTCCTACGGGAGGCAGCAG | 468 | Armingohar et al., 2014 | Plankton—prokaryote |
R | 806R | GGACTACHVGGGTTCTAAT |
Forward and reverse primer evaluations were done individually for each taxon using PrimerMiner v0.18. Unlike other software tools, PrimerMiner considers the position of mismatches (with higher penalties for 3′ mismatches), the type of mismatch (e.g., G-A and A-G have more severe impacts on PCR amplification; Stadhouders et al., 2010) and whether two mismatches are adjacent (adjacent mismatches lead to higher instability). The higher the penalty score calculated using PrimerMiner, the worse the primer is expected to perform. The mean penalty scores (MPS) of both primers were summed per taxon (forward + reverse).
Mock community preparation and sequencing
Sample collection and
Tissue and DNA samples were provided by the Royal Ontario Museum (ethanol-preserved tissues), the River Institute (frozen tissues from the Fish Identification Nearshore Survey program—FINS), Thousand Island National Park (frozen tissue from road kills), the Canadian Phycological Culture Centre (specimen from cultures), and Queen's University in Ontario (frozen and ethanol-preserved tissues, and DNA stored at −80°C). Detailed information on samples (type of sample, year, storage, and DNA extraction method) is available in Table S2.
Composition of the mock communities (
DNA concentration of each sample was assessed using the DeNovix dsDNA High Sensitivity 2 Points Assay kit, following manufacturer protocol. We created five mock communities (MC) using 145 taxa (see Table S3 for MC composition): (i) MC1 included all taxa (N = 91) with a final concentration ≥5 ng/μL, (ii) MC2 included only vertebrates (N = 73) with a final concentration ≥5 ng/μL, (iii) MC3 included only invertebrates (N = 8) with a final concentration ≥5 ng/μL, (iv) MC4 included only plants and plankton (N = 9) with a final concentration ≥5 ng/μL, and (v) MC5 included all tissue-derived DNA samples (N = 61) with a final concentration ≤5 ng/μL. One mammal species (gray four-eyed opossum, Philander opossum) was included in MC1 and MC2, as an alien control. As it is native to Central and South America and not present in Canada, its presence can be used to estimate the false assignment rate (see ‘Section 2.5’).
We collected 1 L water samples from seven tanks containing different aquatic species at the Brockville Aquatarium (Ontario, Canada) on October 22, 2021 using sterile 1 L Nalgene bottles. Disposable nitrile gloves were worn during sampling and changed between tanks. All equipment was decontaminated before and after use in a 10% bleach bath overnight and rinsed thoroughly with distilled water. The first tank contained 10 fish species (“BA1,” water volume = 75,708 L, Table S4). The water system of the second tank (“BA2,” one species only: longnose gar, Lepisosteus osseus) was shared with three other tanks (“Shoreline,” “Lake,” “Sturgeon”) for a total of 20 fish species and a water volume of 151,416 L (Table S4). The third tank (“BA4”) contained sea lamprey (Petromyzon marinus). The fourth tank (“BA5,” eastern musk turtle, Sternotherus odoratus) shared a water system with the gray treefrog (Dryophytes versicolor) tank. The fifth tank (“BA6,” eastern garter snake, Thamnophis sirtalis sirtalis) shared its water system with the American eel (Anguilla rostrata) tank. The sixth tank had American bullfrog (“BA8,” Lithobates catesbeianus). The seventh tank (“BA9,” spotted turtle Clemmys guttata) shared its water system with three other tanks (leopard frog, Lithobates pipiens, green frog, Lithobates clamitans, and mink frog, Lithobates septentrionalis). Water samples were stored in a cooler with ice packs for transport to the laboratory (under 4 h) for filtration.
Finally, we collected three 1 L surface water samples from each of six water bodies in southeastern Ontario (Canada) varying in type, size, and landscape context to maximize the detection of a broad range of species: one upland marsh “Barb” (2021/06/02, 44.52526/−76.37179), one woodland pond “Bonwill” (2021/06/03, 44.542148/−76.399359), one small lake “Elbow” (2021/06/06, 44.473328/−76.430148) plus three sites on the St Lawrence River: “FINS1” (Pointes-des-cascades, 2020/08/06, 45.33436462/−73.95526161), “FINS2” (Hoople NW, 2020/09/04, 44.98663186/−74.94213128) and “TINP” (Thousand Island National Park, 2021/10/27, 44.45343717/−75.85865145). All equipment (including waders) was bleached (10%) and rinsed thoroughly with distilled water before and after sampling between sites. Disposable nitrile gloves were worn for all sampling. The samples were stored in a cooler with ice packs for transport to the lab for the filtration (< 6 h).
We filtered water using polycarbonate filter membranes (PCTE; pore size: 1.0 μm, Sterilitech, Auburn, WA, USA) housed in a 47 mm in-line filter holder (Pall, Port Washington, NY, USA) with Masterflex 24 tubing (Masterflex, Gelsenkirchen, Germany) and a peristaltic pump (Wattera, Mississauga, ON, Canada). Prior to sample filtration and between filtration events, rinses with 1 L of 10% bleach followed by 1 L of distilled water were used to decontaminate the filter holder and tubing (i.e., to avoid cross-contamination between samples). The filter membranes were folded with sterile tweezers (dipped in 95% ethanol and flamed twice to disinfect) and individually stored in 2 mL microcentrifuge tubes containing 500 μL 2% (w/v) cetrimonium bromide extraction buffer (CTAB). Filters were stored at −20°C until eDNA extraction. The eDNA extractions of the filters were conducted using a chloroform-based method (Turner et al., 2014) with the following modifications: the DNA pellet was eluted in 100 μL pre-warmed AE buffer (55°C) to increase elution yields; and the 100 μL eDNA extracts were combined by site for a final volume of 300 μL. eDNA and negative controls of filtration and extraction were stored at −20°C until library preparation.
Library preparation
We used a two-step PCR protocol to amplify target regions and index the samples (Galan et al., 2018). The same PCR1 and PCR2 programs were used for all primer pairs. We included a PCR-negative control for each primer pair and batch of 95 samples. PCR1 was performed in 20 μL reaction volumes using 10 μL of 2X Taq FroggaMix (FroggaBio, Concord, ON, Canada), 5 μL of nuclease-free water (FroggaBio), 1 μL of forward primer (10 μM), 1 μL of reverse primer (10 μM), and 3 μL of eDNA extract. PCR1 cycling conditions consisted of an initial denaturation step at 95°C for 15 min, followed by 40 cycles of denaturation at 94°C for 30 s, annealing at 45°C for 45 s, and extension at 72°C for 2 min, followed by a final extension step at 72°C for 10 min (Tournayre et al., 2020). PCR1 products were checked on an agarose gel (1.5%) stained with RedSafe™. PCR2 was carried out in 24 μL reaction volumes using 12 μL of 2X Taq FroggaMix (FroggaBio, Concord, ON, Canada), 8 μL of nuclease-free water, 1 μL of each indexed primers i5 (10 μM) and i7 (10 μM) (8 bases each; Kozich et al., 2013) and 2 μL of PCR1 product. PCR2 started with an initial denaturation step of 95°C for 15 min, followed by 10 cycles of denaturation at 95°C for 40 s, annealing at 55°C for 45 s and extension at 72°C for 2 min followed by a final extension step at 72°C for 10 min. PCR2 products were again checked on an agarose gel (1.5%). For each primer pair, we pooled the replicates and samples for each sample type (aquarium samples, waterbody samples, and mock community samples), resulting in 37 libraries. We loaded 60 μL of each of these 37 libraries on an agarose gel (1.0%) stained with RedSafe™ and then excised bands with single-use scalpel blades on a UV transilluminator. Gel slices were placed in 2.0 mL sterile tubes and purified using a Wizard® SV Gel and PCR Clean-Up System (Promega, Madison, WI, USA) following the manufacturer's protocol. The 37 purified libraries were quantified using the NEBNext® Library Quant Kit for Illumina® (NEB E7630, New England Biolabs, Ipswich, MA, USA) kit on a Bio-Rad CFX96 qPCR instrument (Bio-Rad Laboratories, Hercules, CA, USA) and pooled in equimolar proportion to obtain a final library at 4 nM. All samples were processed on a single run: 10 pM of the library and 10% of PhiX control were sequenced on a MiSeq Illumina flow cell with a 600-cycle Reagent v3 kit (2×300 bp pair-end sequencing, Illumina, San Diego, CA, USA).
Bioinformatics and taxonomic assignments
Raw sequencing reads were demultiplexed using the MiSeq Reporter software based on the index combinations. We used the v1.7.6 Barque pipeline (Mathon et al., 2021) to process the reads (settings detailed in Appendix S1). Briefly, raw reads were trimmed using Trimmomatic v0.36 (Bolger et al., 2014), paired-end reads were merged using Flash v1.2.11 (Magoč & Salzberg, 2011), and merged-reads were split based on primer pair. Sequences of unexpected length were discarded, as were chimeric sequences using Vsearch v2.15.2 (Rognes et al., 2016). Unique reads were merged and assigned to species using Vsearch (≥98% identity, ≥60% coverage). We discarded all species highly unlikely to be present in our study region, including via lab contamination (e.g., polar bear, Ursus maritimus), marine species, species not occurring in Canada or within our study area, micro-organisms in the mock communities (e.g., Wolbachia, diatoms, cyanobacteria, crustaceans such as the freshwater ostracod, Cypridopsis vidua) that were likely unintentionally sampled along with target taxa. Due to uncertainty on species taxonomy, we considered bacterial taxa only at the genus level, excluding cyanobacteria. We then used (1) detections from the negative controls to filter for cross-contamination (TCC; Galan et al., 2016), (2) the alien control (gray four-eyed opossum) to filter positive detections due to the false assignment rate of reads (TFA; Galan et al., 2016), (3) the technical replicates to control for PCR and sequencing stochasticity (we considered a positive detection for a taxon to comprise at least two positive PCR replicates out of three), and (4) we only retained taxa with at least three reads in the eDNA water samples as an additional step to decrease false positive detections while minimizing false negatives. Last, we discarded potential redundancy: when sequences were identified at both the species level and the order or family or genus level, we kept only the identification at the species level (note that in some cases this strategy discarded a high number of reads; for example, 4528 reads for a midge, Dicrotendipes sp. for the primer pair fwhEPTD). We kept detection at the genus level if the potential species was not already identified at the species level (e.g., 16SMOL detected Ameiurus natalis and Ameiurus melas/nebulosus so we kept both) or if the genus level identification was detected in a different sample than we retained the species level identification (e.g., Tanytarsus sp. for primer pair fwhEPTD).
Statistical analyses
All the statistical analyses were done in R v4.3.1 (R Core Team, 2022). We ran a quasi-binomial generalized linear regression model (GLM) to assess the effect of the number of reads, target fragment size (bp), and the percentage of degenerate bases in the primer sequences (forward + reverse) (excluding null proportion). As we did not expect any particular number of detected taxa for the water eDNA samples collected in southeastern Ontario, we used instead a quasi-Poisson GLM and the number of detected taxa as the response variable (excluding null detections). A quasi-binomial GLM model and subsequent Analysis of Deviance Table (test = “F”) were performed to assess the effect of the gene, target fragment size (bp) and the percentage of degenerate bases in the primer sequences (forward + reverse) on the percentage of taxa identified at the species level. Note that we considered only the expected species that passed all the filtering criteria in the mock community and aquarium analyses. We assessed the significance of differences between pairs of group means by conducting a post-hoc analysis with the emmeans v1.10.0 package (Tukey method for p-value adjustment) (Lenth, 2023).
Selecting novel informative primer-sets for
We developed a new online tool with a graphical user interface for eDNA primer pair selection based on empirical species detection data from multiple primer pairs that may vary in taxonomic focus and resolution: SNIPe (Selecting Novel Informative Primer-sets for e-DNA) ().This new tool involves two steps: (1) “Choose or upload a dataset” allowing use of our default dataset using 14 primer pairs or by uploading new datasets or data collated from the literature; (2) “Discover primer pair sets” allowing users to choose the suite of taxa of interest and a maximum number of primer pairs desired, and then to explore the number of taxa detected for combinations of primer pairs from one to the maximum specified (Figure 5). Results include the percentage of recovered taxa per primer pair combination, an indication of which taxa would be added using “n” primer pairs versus combinations with “n−1” primer pairs, Euler diagrams to visualize overlap among primer pairs, data on taxonomic resolution for each primer pair combination, stacked histograms by primer pair and taxonomic group, and a stacked histogram showing taxon accumulation for different numbers of primer pairs. It is also possible to distinguish on- and off-target taxa detected by each combination of primer pairs. The eDNA dataset (water samples from the six sites in Eastern Ontario) from our study is included as the default dataset (available in Table S5) with the option for users to upload their own datasets for exploration.
RESULTS
In silico evaluation: Reference database and performance
Genes with the most exhaustive references databases
We assessed a total of 1031 taxa spanning 144 orders and 257 families: 27 reptiles, 26 amphibians, 180 fish, 16 semi-aquatic mammals, 72 aquatic birds, 101 mollusks, 33 annelids and platyhelminthes, five rotifers, 59 crustaceans, 116 plants, and 396 phytoplankton taxa (Table S6). The COI gene had the most exhaustive reference database for vertebrates and invertebrates (Figure S1A). Two species of branchiopod, one malacostracan, four annelids, one mollusk, and seven fish had no sequence available for any genes (Table S6). All plant species had rbcl sequences available (Figure S1B). Bacterial 16S had the most exhaustive reference database for cyanobacteria and 18S for eukaryotic phytoplankton taxa (Figure S1C). However, more than one-third of the phytoplankton species (37.8%) did not have sequences available for any of the tested genes (Table S6).
Evaluation in silico of the primer pairs
Overall MG2 (COI) and B2 (COI) had the lowest mean penalty scores for vertebrates (mammals, birds, reptiles, amphibians, and fishes) and insects (MPSMG2 and MPSB2 < 100) and are therefore expected to perform better than the other primers pairs for these groups of taxa (Figure 1). Unexpectedly, the insect-specific fwhEPTD primer pair showed higher penalty scores (73.0 < MPS < 395.1) than primer pairs not specific to insects, especially for Trichoptera (MPS = 395.1), Ephemeroptera (MPS = 187.2), and Odonata (MPS = 151.9) (Figure 1). MG2 (COI), Leray (COI), and B2 (COI) had the lowest mean penalty scores for crustaceans and rotifers (Figure 1). 16SMOL and 16SMOL2 had the lowest penalty scores for the mollusk and platyhelminth orders (Figure 1). Uniorbcl always had lower MPS for plants compared to its original version, orbcl2, except for Caryophyllales (MPSUniorbcl = 134.5; MPSorbcl2 = 18.7) (Figure 1). Uni16S (16S) was expected to perform very well for cyanobacteria (MPS < 20) and both 18S primer pairs were expected to perform well on most eukaryotic phytoplankton orders, although overall UniPh (18S; 0 < MPS <91) was anticipated to perform best than its original version Uni18S (0 < MPS < 274.6) (Figure 1).
[IMAGE OMITTED. SEE PDF]
In vitro and in vivo evaluation: Mock communities and
Sequencing and filtering results
Our Illumina sequencing produced 17,945,331 reads total of which 16,715,207 passed Illumina filters with 12,014,198 assigned to the samples using the index combination (Table S7). We retained 2,096,942 reads after processing and annotating the sequences in barque (see ‘2.5’ and Appendix S1). After filtering out sequences based on taxonomic information, we observed between 0 and 381 reads in the negative controls (a total of 2644 reads), and obtained a total of 1,760,673 reads after the TCC filtering. Among the 8132 reads of Philander opossum (alien positive control), eight reads were mis-assigned with a maximum of two reads in a given sample (RFA = 0.02%). The false-assignment rate TFA ranged between 0 and 13 reads, and after filtering the results table included 1,760,212 reads. We obtained 1,750,845 reads after filtering out technical replicate inconsistencies, and after removing taxonomic redundancy and detections with less than three reads in the water samples, we obtained a final dataset of 1,538,507 reads for the mock communities (expected species: 1,100,086 reads; details by primer in Table S7), 58,391 reads for the aquarium tanks (expected species: 13,302 reads; details by primer in Table S7), and 136,618 for the eDNA samples from natural systems (details by primer in Table S7). Note that the Folmer primer pair was not processed because the gap between R1 and R2 was too large.
Mock communities
Twenty-two taxa (two phytoplankton, nine plants, one snake, one frog, five zooplankton, and four mollusks) were not recovered by any of the primer pairs in any of the mock communities (Table S3; Figures 2 and S2). Six taxa (three plants and three phytoplankton) were detected but discarded because of replicate- and negative control-based filtering (Table S3; Figures 2 and S2). Some taxa with challenging morphological identification were successfully identified at the species level based on their barcode (e.g., Moxostoma anisurum, Ceratophyllum echinatum, Gerris insperatus, and Sphaerium striatinum). Conversely, some taxa identified at the species level based on morphology were recovered only at the genus/family level based on their DNA barcode (e.g., Elodea sp., Potamogeton sp., and Typha sp.). Taxonomic identifications were mostly consistent within and across genes except for Pseudacris maculata in MC5 (identified as P. triseriata by all COI primer pairs but MG2), Elliptio complanata in MC1 and MC3 (identified as E. crassidens by MG2) and Branta canadensis hutchinsii in MC1 and MC2 (identified as B. hutchinsii by Leray) (Table S3, Figure 2). The overall percentage of expected taxa identified at the species level was influenced by the gene (Fdf = 5 = 23.671, p 0.001) but not the length of the target fragment (p = 0.340) nor the percentage of degenerate bases (p = 0.991). The percentage of taxonomic identification at the species level was higher for COI than for ribosomal 16S (p = 0.031), 18S (p = 0.018), and rbcl (p < 0.0001), and higher for mitochondrial 16S than for ribosomal 16S (p = 0.019), 18S (p = 0.011), and rbcl (p = 0.003).
[IMAGE OMITTED. SEE PDF]
Because of filtering, the COI Leray, B2, and fwhEPTD primer pairs had 10, 14, and 18 detections discarded, respectively (Table S3). Overall, MG2 recovered the highest number of taxa (N = 100) with 161 detections at the species level, four at the genus level and two at the family level, followed by Leray (90 taxa, 151 detections at the species level, and one at the genus level) and B2 (88 taxa, 148 detections at the species level, and two at the genus level) (Figure 2 and Table S3). For MC1, Leray recovered the highest number of invertebrates (7/8), MG2 the highest number of vertebrates (64/73), Uni16S the highest number of phytoplankton (1/1) and orbcl2 the highest number of plants (3/8). In MC2, MG2 recovered the highest number of taxa (64/73) while Leray and MG2 were equivalent in MC3 (7/8). In MC4, the best primer pairs were 16SMOL, 16SMOL2, and Uni16S for phytoplankton (1/1), and Uniorbcl and orbcl2 for plants (3/8). Finally, MG2 was the best in MC5 for invertebrates and vertebrates (9/18 and 18/19, respectively) while UniPh, orbcl2, and Uni16S were the best for phytoplankton (1/6) and Uniorbcl and orbcl2 for plants (7/18). We observed a significant effect of the number of reads (Estimate = 7.707e-05, p = 3.52e-09) and primer degeneracy (Estimate = 3.302, p = 0.034) on the proportion of detected taxa. We run the same analyses without the primer pairs that comprised less than 5% of the total reads (<55,004 reads) to account for potential outliers driving the signal without losing too much information (McMurdie & Holmes, 2014): only the effect of the number of reads (Estimate = 6.392e-05, p = 3.15e-06) remained significant.
Taxa included in more than one mock community showed similar detection success among them (i.e., same primer pairs detecting the taxa at the species level in the mock communities). For example, the fish Ambloplites rupestris was detected by the same eight primer pairs in both MC1 and MC2. Overall, 20 taxa (one mollusk, two arthropods, one phytoplankton, and 16 vertebrates) showed differences in detections among mock communities (i.e., differences in the number of primer pairs detecting the taxa, Figure S3) because of inconsistent technical replicates that were discarded in the filtering process (Table S3).
Aquarium tanks
The eDNA extraction for the first aquarium tank (BA1) failed so none of the species were recovered in this sample. However, the redbelly dace (Chrosomus eos) from BA1 was identified in the second tank system (BA2) where it is currently not present. We recovered some taxa that were part of the fish diet (e.g., Mysis, shrimp, rainbow smelt, poultry, and other protein meals), turtle/frog diet (e.g., cricket, worms) and otter diet (e.g., mackerel, hake) (Table S4), as well as microscopic taxa such as oomycota or acari mites that are common of the natural materials in terraria.
In the six other tanks, five species were detected but did not pass our conservative filtering steps (two positive replicates and/or threshold of three reads minimum): the rockbass (Ambloplites rupestris), the channel catfish (Ictalurus punctatus), pumpkin seed (Lepomis gibbosus), bluegill (Lepomis macrochirus), and spotted turtle (Clemmys guttata) (Table S4; Figure S2). Two fish species from BA2 were misidentified by our markers: the striped bass (Morone saxatilis) was identified as the white bass (M. chrysops) and the fallfish (Semotilus corporalis) was identified as the creek chub, S. atromaculatus (Table S4). Two species present in the largest tank system (BA2) were not detected by any primer pair despite being identified in our eDNA samples taken from natural systems: the spotfin shiner (Cyprinella spiloptera), and black crappie (Pomoxis nigromaculatus). Among the remaining detections, we identified 50% of the 30 species present, all to species level except longnose gar, Lepisosteus osseus, identified at the genus level with MiFish, and white sucker Catostomus commersonii, at the genus level with Leray (Table S4; Figure S2). We did not observe any effects of gene, target amplicon length, or degenerate bases on the percentage of taxa identified at the species level (p > 0.3). Neither the number of reads (p = 0.058), nor the other explanatory variables (p > 0.26) had significant effect on the proportion of detected taxa.
We obtained a total of 461 taxa from 34 phyla, 135 orders, 193 families, and 294 genera from the six eDNA samples collectively (Figure 3), including 22 taxa that were endangered, invasive, introduced, parasitic, pests, toxic, or water quality bioindicators (Table S8). Overall, the rbcl and 18S primer pairs identified a higher number of taxa but with poorer resolution: an average of 64.6% and 72.8% of taxa identified at the species level, respectively, versus 85%, 100%, and 92% at the species level for the COI, 12S, and mitochondrial 16S primer pairs, respectively (Figure 4). Note that we manually adjusted the bacteria identification at the genus level, so the Uni16S resolution percentage was underestimated, and it was therefore not included in the resolution analyses. The percentage of taxa identified at the species level was influenced by the gene (Fdf = 4 = 7.797, p 0.001), with a higher percentage for COI than 18S (p = 0.007) and rbcl (p < 0.002). We observed a marginally non-significant effect of the length of the target fragment (p > 0.053) and no effect of the percentage of degenerate bases (p > 0.707).
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
The highest number of vertebrates was recovered using COIFishdegen (N = 30), invertebrates using fwhEPTD (N = 67), phytoplankton and protists using UniPh (N = 71), plants using Uniorbcl (N = 31), and bacteria using Uni16S (N = 36) (Figure 4). The number of taxa was positively impacted by the number of reads (Fdf = 1 = 63.594, p 0.001), the target fragment length (Fdf = 1 = 8.150, p = 0.005), and the percentage of degenerate bases (Fdf = 1 = 8.114, p = 0.005). We found the same results when running the same analyses without the primer pairs accounting for less than 1.5% of the total reads (<2000 reads; 16SMOL2 = 440 reads, MiFish = 1714 reads, B2 = 1976 reads).
Decision tool for selecting primer pairs
We created a free, open-source online tool—SNIPe—using JavaScript that allows users to explore eDNA data for multiple primer pairs and select combinations that best serve the goals of their study (). We illustrate the user interface and workflow for SNIPe in Figure 5. Applied to our eDNA dataset, analyses using SNIPe indicated that 13 primer pairs would be required to recover 100% of the taxa. Using our best single primer pair (UniPh), we recovered 25.6% of all taxa. Thereafter, combining our two best primer pairs (UniPh + Uniorbcl), we recovered 46.9% of all taxa; for three (UniPh + Uniorbcl + fwhEPTD), 61.8%; for four, 74.8% (UniPh + Uniorbcl + fwhEPTD + Uni16S); and for five (UniPh + Uniorbcl + fwhEPTD + Uni16S + 16SMOL), 82%. SNIPe also distinguished on- and off-target taxa. For example, while increasing the number of primer pairs slightly increased the number of fish (from 23 taxa with COIFishdegen to 33 taxa with five pairs), it also dramatically increased the number of off-target species: nine off-target species (invertebrates) with one primer pair, 39 (invertebrates) with three primer pairs, and 140 (invertebrates, microscopic animals, phytoplankton and protist, fungi) with five primer pairs (Figure S4).
[IMAGE OMITTED. SEE PDF]
DISCUSSION
Describing freshwater communities
Our study highlights the potential for eDNA-based methods to quantify aquatic biodiversity overall, monitor rare and cryptic species, invasive species, and assess trophic interactions (Beng & Corlett, 2020; Bohmann et al., 2014). We were able to detect hundreds of species across the tree of life, including several species relevant to conservation, human health and agriculture, model species used in toxicity testing (e.g., Diptera, Annelida), species-at-risk (e.g., the paper pondshell, Utterbackia imbecillis, a species-at-risk mollusk in Canada), pathogens (oomycete, toxin-producing algae), parasites (e.g., parasitic fish, chytrid), pests (e.g., lepidopteran spongy moth, dipteran spotted-wing Drosophila), and invasive species (e.g., fish: round and western tubenose gobies, sea lamprey; mollusks: zebra and quagga mussels; plant: purple loosestrife). The use of eDNA for cost-effective and non-intrusive monitoring and management of invasive species and species-at-risk has seen increasing attention and attempts at standardization in recent years (Abbott et al., 2021; Fediajevaite et al., 2021). The concomitant detection of invasive/native species can help to delineate the invasion front of an invasive species (Keller et al., 2022) and capture shifting dynamics of river ecosystems (Greenhalgh et al., 2022), for example, the increasing detection in time and space of dreissenid mussels to the detriment of local native species (Ricciardi et al., 1996).
Our study also showcases the power of aquatic eDNA-based metabarcoding to reconstruct freshwater communities in connected systems. For example, we were able to detect species from different aquaria by collecting water samples from the shared water system and detecting taxa that comprised the diet of some captive species. Some unexpected detections could be attributed to incorrect species identification by our primers (e.g., stripe bass, fall fish) not able to properly distinguish some closely related taxa, but the detection of the redbelly dace in the BA2 tank system was surprising because the species was present in this system for less than one year in 2017. As DNA detectability in the water column is limited to a few weeks (Dejean et al., 2011) and the water at this facility is treated, the source of this unexpected detection is likely a contamination from the dive tank (BA1) during sampling.
Although we cannot rule out the possibility of contamination from the mock communities or DNA in the lab despite our meticulous and conservative filtering, our results from natural freshwater water bodies would be consistent with eDNA transport by water flow from upstream lakes into the marshes and ponds (Deiner & Altermatt, 2014; Laporte et al., 2020; Wacker et al., 2019). For example, we detected the yellow perch (Perca flavescens) within a shallow palustrine marsh that cannot support this species, but the marsh does connect to a small freshwater lake upstream where the species is known to occur. While such aquatic transport of eDNA may be considered an issue for deducing the precise location of a detected species or local species richness, it can also be used to make inferences on species distributions at a larger spatial scale. For example, Deiner et al. (2016) showed that eDNA data from sampling stream networks can provide insights on biodiversity distribution across hierarchical spatial scales, including both terrestrial and aquatic species. Many species (e.g., largemouth bass Micropterus salmoides, non-biting midge, Ablabesmyia americana, eastern elliptio Elliptio complanata, water stargrass Heteranthera dubia) detected with eDNA metabarcoding were known to occur at the sampling sites based on land-based sampling (BIObus Canadian National Parks Malaise Program) or traditional aquatic surveys (Queen's University Biological Station, Thousand Islands National Park, and the River Institute). Interestingly, some expected and rather abundant species, such as the American leech (Macrobedella decora) or broadleaf arrowhead (Sagittaria latifolia) were not recovered in some eDNA samples. The false negative detections could be due to several factors: insufficient sampling effort (Guillera-Arroita et al., 2017), a competition between DNA templates of other taxa during PCR (Kelly et al., 2019; Marinchel et al., 2023), and/or our conservative filtering approach that discarded true detections (Alberdi et al., 2018; Mathon et al., 2021). Thus, we recommend having taxonomists and ecologists check the lists of species obtained from eDNA data, as this can reveal both potential false positives and false negatives, and guide interpretation of the results. A synergistic approach combining traditional surveying techniques and Indigenous knowledge with eDNA can validate otherwise weakly substantiated eDNA datasets (Lopez et al., 2023).
Validation and selection of the best primer pair combination(s)
Our study highlighted disparities in the completeness of public databases of reference sequences across the tree of life and among genes. Despite duplication between databases (Meiklejohn et al., 2019), our analysis also showed differences in species coverage between databases for the same gene: the BOLD database encompassed more species than GenBank for COI, probably due to the dramatic expansion in geographic and taxonomic coverage in BOLD by the international Barcode of Life consortium (BIOSCAN program, Hobern, 2021). While we recovered most animal and plant species, we observed a lack of reference sequences for phytoplankton in GenBank (~40% of the species included in our study did not have sequences for any of the tested genes) probably because of challenging morphological identification of phytoplankton to the species level (McManus & Katz, 2009; Tzafesta et al., 2022). Similar to other studies, we observed differences between genes (e.g., COI covered more mammal species than 18S in GenBank) (Marques et al., 2021) and metazoan species that lacked COI reference sequences also lacked sequences available for any other marker. This supports the preference for COI primers in animal metabarcoding studies, yet COI is not the “gold standard” for all animal taxonomic groups. For example, most fish metabarcoding studies use primer pairs for a portion of the 12S gene (e.g., MiFish, Miya et al., 2020; teleo, Valentini et al., 2016), partly because their primer binding regions are less variable than the COI ones and thus co-amplify fewer non-fish species (Macher et al., 2023; Stat et al., 2017). There is then a key need for completing 12S general and fish-specific databases such as MitoFish (Collins et al., 2019; Iwasaki et al., 2013; Macher et al., 2023; Sato et al., 2018).
Reference sequences are critical for evaluating primer pairs on a large scale as we have attempted, especially as they are used in silico to predict amplification success by calculating a score based on mismatches between primer and template sequences (Elbrecht & Leese, 2016; Ficetola et al., 2010). Such analysis can include consideration of the number of mismatches, their positions on the sequence and the adjacency of mismatches. However, in silico results do not always match in vitro results (e.g., Tournayre et al., 2020). First, the sequences contained within reference databases are produced using different approaches (e.g., Sanger sequencing, whole genome shotgun sequencing), and different primers pairs may target different parts of a focal gene. This can result in heterogeneity in the availability and quality of sequences, and in species recovery among primer pairs for the same gene. Second, PCR conditions can greatly influence the performance of primers (e.g., annealing temperature, number of cycles) (Krehenwinkel et al., 2017), and in silico analyses cannot predict PCR bias (e.g., DNA competition, amplification stochasticity). Intraspecific variation, particularly regional variation, can also limit in silico primer testing (Estensmo et al., 2021). In our study, in silico and mock community results were very similar for metazoans: overall the COI mini-barcode MG2 (Tournayre et al., 2020) was the best for invertebrates and vertebrates, closely followed by Leray (Leray et al., 2013) and B2 (Elbrecht & Leese, 2017). Contrary to expectations from the in silico analyses, the 16S mollusk primers (16SMOL, Klymus, Marshall, & Stepien, 2017; and 16SMOL2, this study) did not recover more mollusks than the other primer pairs, potentially because we did not use blocking primers that would have favored mollusk amplification by impeding the amplification of co-occurring non-mollusk taxa such as fish (Klymus, Marshall, & Stepien, 2017). For example, Rojahn et al. (2021) both increased the detection success of rare taxa and decreased sampling effort by using blocking primers for highly abundant species. Such issues are especially likely for 16SMOL2 because these primers have a high proportion of degenerate bases within primers (~26%) that probably increase the amplification of off-target species (Linhart & Shamir, 2002). Surprisingly, Uniorbcl, our degenerate version of the orbcl2 plant primer (Coghlan et al., 2021), and UniPh, our degenerate version of the Uni18S primer pair (Zhan et al., 2013) did not perform better than their original version in mock communities. However, they did appear to perform best in the eDNA samples from our sampled water bodies, so the limited taxonomic range of the mock communities may have limited the potential of these primer pairs. Such disparities in interpretation emphasize the usefulness of combining controlled experiments (in silico, mock communities, aquarium) and field surveys to assess the benefits and the limitations of primer pairs before conducting full-scale metabarcoding studies. Moreover, the proportion of off-target taxa can now be easily evaluated using SNIPe to inform primer selection. The amplification of off-target taxa is an important consideration as it can monopolize a significant proportion of the sequencing depth and therefore increase the possibility of missing the rare target species (Leese et al., 2021). We showed that combining primer pairs yielded a better recovery of target species but also significantly increased the number of off-target taxa (Figure S4). Off-target analysis in SNIPe can also help to select among primer pairs with equivalent on-target detection success. For example, UniPh, fwEPTD, and Leray all recovered three mollusk taxa, but they recovered contrasting numbers of off-target taxa: 115, 68, and 42, respectively. Similar to the species-specific assay validation scale proposed by Thalinger et al. (2021), metabarcoding primer evaluation would benefit from a validation schema. We provide a four-level metabarcoding validation system in Figure 6, with incomplete (in silico), partial (in silico and controlled environment), essential (in silico, controlled environment, natural system), and substantial (in silico, controlled environment, natural system, modeling) validation levels.
[IMAGE OMITTED. SEE PDF]
A challenge when designing an eDNA study is selecting combinations of primer pairs which best serve the objectives of the study. Our online tool, SNIPe, will help researchers focus on a particular set of taxa relevant to their study (e.g., predator and prey, competing native and invasive species, or species at different trophic levels), and to optimize which, and how many, primer pairs they should or can use to obtain occurrence data to best address their research questions. We provide a default dataset comprising 14 primer pairs and the suites of species detected for each. SNIPe can also accommodate user-supplied datasets from work in their own labs, or data compiled from literature, allowing for powerful cross-study primer selection. All dataset processing and computation is done on the users’ own computer inside the web browser, meaning no user-uploaded data are transmitted to nor retained on the SNIPe web server, and that researchers can use this for evaluating data with data sovereignty issues (e.g., from Indigenous lands) or public sensitivities (e.g., pathogen or invasive species distributions).
Maximizing prospects for successful metabarcoding
Filtering steps and their stringency can greatly influence detection success. For example, we showed that our conservative filtering approach (“true detection” = two positive PCR replicates, three or more reads) decreased the risk of false positives at the cost of increasing false negatives. PCR replicates are important to account for PCR and sequencing stochasticity and primer efficiency, especially in eDNA samples which have low and heterogeneous DNA concentrations (Alberdi et al., 2018; Ficetola et al., 2015). However, because rare taxa are recovered stochastically, they are often unique to one PCR replicate and would be considered as a false positive (Shirazi et al., 2021). Increasing sequencing depth is important to recover more taxa, but it also concomitantly increases the dissimilarity among PCR replicates by increasing the number of sequences originating from contamination and chimeras (Alberdi et al., 2018; Deagle et al., 2014). We could have pooled replicates before indexing, but this is not necessarily more efficient for assessing diversity and community structure (Smith & Peay, 2014); for example, while pooling reduces the risk of missing taxa, it also artificially inflates the possibility of false positives due to accumulation of artifactual sequences (Alberdi et al., 2018). Thus, we emphasize the need for clarity and transparency in the methods and results of metabarcoding studies (i.e., detailed sequence processing and filtering steps and outputs) for relevant interpretations of the data.
We used a single sequencing run for all sample types and primer pairs. The choice of the sequencing platform, reagent kit, and number of sequencing runs is informed by the anticipated number of detected taxa. However, it can be challenging to estimate such numbers in advance, especially when targeting a broad range of taxa and combining many primer pairs. For example, we had anticipated a maximum of 500 taxa and 10 million reads (meaning over 1000 reads per taxon and primer pair). We obtained a heterogeneous number of reads among primer pairs, with relatively shallow sequencing depth limiting the number of detected taxa in the mock communities and natural samples (Alberdi et al., 2018; Shirazi et al., 2021). For example, the three primer pairs with the lowest number of reads in the natural samples detected the lowest number of taxa: 16SMOL2 (440 reads: 14 taxa), MiFish (1714: 17 taxa), and B2 (1976: 21 taxa). However, MiFish had twice as many reads as COIFish and COIFishdegen in the aquaria samples and still recovered fewer species, implying that lack of sequencing depth alone does not explain differences in performance among primer pairs.
Primer amplification success was influenced only by primer degeneracy in the mock communities (where DNA was not degraded), and by both primer degeneracy and the length of the target fragment in the eDNA samples. Primer degeneracy is useful to minimize mismatches between primers and template sequences (Elbrecht & Leese, 2017). The length of the target fragment is also crucial because long fragments are rarer than short ones in water, and eDNA will likely break into smaller fragments over time (Bylemans et al., 2018). In sum, our results support using degenerate primers and short amplicons if the goal is to amplify as many species as possible while minimizing the number of eDNA metabarcoding primer pairs. Regardless of target length and primer degeneracy, the choice of gene is pivotal. As in other studies (Deagle et al., 2014; Stat et al., 2017), our results show that nuclear markers provided a broad taxonomic coverage (e.g., 18S rRNA) but less resolution than mitochondrial ones (e.g., COI, 12S). We also reported heterogeneity in reference sequence databases among genes, which can lead to incorrect identification with seemingly high confidence. For example, 16S rRNA primers identified the wrong Ranatra species with high confidence (>98% identity) because Ranatra fusca is absent from the database. The lack of reference sequences will likely have a greater impact on taxonomic resolution for short compared with long amplicons as the chance of getting a species identification with high confidence (high %identity and %coverage) for the wrong taxa would be higher for shorter amplicons (Tournayre et al., 2020). Some molecular-morphological discrepancies may have originated from erroneous identifications based on morphology. Indeed, some plants in our mock communities were not checked by botanical experts and could have been misidentified in the field. For example, the species we had morphologically identified as Juncus effusus (MC4 and MC5) was molecularly detected with a strong signal as Eleocharis sp. instead (Uni18S: 8839 reads, UniPh: 10228 reads, orbcl2: 8313 reads, and Uniorbcl: 6508 reads); these taxa are phenotypically similar and could be confused.
While some issues may pertain to human error in reference databases (Cheng et al., 2023; Galan et al., 2018), others may be caused by molecular misidentification resulting from complex phylogenetic relationships, changing systematics, or taxonomic synonyms, inability to distinguish among species that originated recently (incomplete lineage sorting; e.g., Lopez-Vaamonde et al., 2021), or cyto-nuclear discordance caused by introgression or mitochondrial capture (e.g., Thielsch et al., 2017). For example, Nabholz (2024) found that the inability to distinguish among four Chorthippus grasshoppers using molecular markers was the result of recent divergence and incomplete lineage sorting. As an example of mito-nuclear discordance, the taxonomy of the western chorus frog (Pseudacris triseriata) in Ontario, Quebec, and adjacent New York is unresolved because of apparent mito-nuclear discordance and the presence of two deeply-divergent mitotypes (Pseudacris triseriata/maculata) with a single nuclear background (Lougheed et al., 2020). While these factors dictate that we exhibit care in interpreting metabarcoding data, this does not diminish its potential for quantifying significant swaths of aquatic biodiversity.
CONCLUSION
Our workflow combining in silico, mock communities, aquarium, and eDNA samples from natural systems, showed that multi-marker DNA metabarcoding is promising for evaluating freshwater biotic communities. Our results revealed that even a low number of combined primer pairs can identify a suite of taxa from primary producers through herbivores to predators, native and invasive species, and rare and common species. Our online tool SNIPe will help researchers make critical decisions on the choice of primers (target taxa, gene, length of the amplicon, degeneracy) that reflects the goals of the study and the trade-off species coverage-cost. Ultimately, our study represents a step forward in overcoming the technical challenges associated with the holistic monitoring of freshwater communities.
AUTHOR CONTRIBUTIONS
Design of the study: OT, SCL. Field work: OT, HT, MJSW. Lab work: OT, HT, ZS. Bioinformatics: OT, DRL. Analysis: OT. Interpretation of the data: OT, SCL, MJSW, SL, JC. Drafting of the manuscript: OT, SCL. Editing of the manuscript: all authors. Funding acquisition: SCL, BFC, SEA, YW, JR.
ACKNOWLEDGMENTS
We thank Burton Lim, Maureen Zubowski, Amy Lathrop, and Santiago Claramunt (Royal Ontario Museum), and Barb Vanderbeld (Queen's University) for providing samples for the mock communities. We are grateful to Jenna Shannon and Lena Campos for helping with DNA extraction. This work was funded by a SSHRC New Frontiers in Research Fund—Exploration grant (NFRFE-2019-00892), an Alliance Society grant (ALLRP 580309-22), a South Frontenac grant, and the Baillie Family Chair in Conservation Biology.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflict of interest.
DATA AVAILABILITY STATEMENT
The MiSeq raw data (R1 and R2.fastq.gz files), indexing sheet and datasets post-barque pipeline (one per primer pair) were archived in the Figshare public repository: .
Abbott, C., Coulson, M., Gagné, N., Lacoursière, A., Bajno, R., Dietrich, C., & May‐McNally, S. (2021). Guidance on the use of targeted environmental DNA (eDNA) analysis for the Management of Aquatic Invasive Species and Species at risk. DFO Can. Sci. Advis. Sec. Res. Doc. 2021/019. Iv +42p.
Abell, R., Thieme, M. L., Revenga, C., Bryer, M., Kottelat, M., Bogutskaya, N., Coad, B., Mandrak, N., Balderas, S. C., Bussing, W., Stiassny, M. L. J., Skelton, P., Allen, G. R., Unmack, P., Naseka, A., Ng, R., Sindorf, N., Robertson, J., Armijo, E., … Petry, P. (2008). Freshwater ecoregions of the world: A new map of biogeographic units for freshwater biodiversity conservation. Bioscience, 58(5), 403–414. [DOI: https://dx.doi.org/10.1641/B580507]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Freshwater ecosystems are complex, diverse, and face multiple imminent threats that have led to changes in both structure and function. It is urgent that we develop and standardize monitoring tools that allow for rapid and comprehensive assessment of freshwater communities to understand their changing dynamics and inform conservation. Environmental DNA surveys offer a means to inventory and monitor aquatic diversity, yet most studies focus on one or a few taxonomic groups because of technical challenges. In this study, we (1) create an eDNA metabarcoding dataset (natural water bodies) with 14 validated primer pairs; (2) create a free online, user‐friendly tool for primer selection that can be used for any metabarcoding data (SNIPe); and (3) using SNIPe, explore our dataset to derive subsets of informative, cost‐effective primer pairs that maximize detection of freshwater diversity. We first evaluated the completeness of public reference sequence databases and the efficiency of 14 primer pairs in silico, in vitro on five mock communities (mix of DNA from tissues of select taxa), in vivo on water samples from aquarium samples with known taxonomic composition, and finally in vivo on water samples from freshwater systems in Eastern Canada. Results from analyses using SNIPe revealed that 13 or 14 primer pairs are necessary to recover 100% of the species in water samples (natural systems), but that four primer pairs are sufficient to recover almost 75% of taxa with little overlap. Our work highlights the power of eDNA metabarcoding for reconstructing freshwater communities, including prey, parasite, pathogen, invasive, and declining species. It also emphasizes the importance of marker choice on species resolution, and primer characteristics and filtering parameters on detection success and accuracy of biodiversity estimates. Together, these results highlight the usefulness of eDNA for freshwater monitoring and should prompt more studies of tools to survey all communities.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details



1 Department of Biology, Queen's University, Kingston, Ontario, Canada, Department of Biology, York University, Toronto, Ontario, Canada
2 Department of Biology, Queen's University, Kingston, Ontario, Canada
3 Canadian Centre for Computational Genomics, McGill University, Montreal, Quebec, Canada
4 River Institute, Cornwall, Ontario, Canada
5 Parks Canada, Mallorytown, Ontario, Canada
6 Brockville Aquatarium, Brockville, Ontario, Canada