Ecological surveying and biodiversity assessment have been a primary goal for ecologists and conservationists aiming to describe and interpret the state and change of biodiversity in an ecosystem. Traditional freshwater monitoring methods developed for target taxa (e.g., fish, macroinvertebrates, macrophytes, and algae) are costly, both in terms of time and financially. However, in the last 10 years, there have been significant developments in biodiversity assessment with molecular tools. This revolution in molecular biodiversity assessment has been termed a “game changer” (Lawson Handley, 2015), where High-Throughput Sequencing (HTS) platforms are used to generate millions of DNA sequences from bulk tissue or environmental samples in a process known as metabarcoding.
Environmental DNA (eDNA) metabarcoding is now commonly used for biodiversity assessment in aquatic environments (e.g., Blackman et al., 2022; Brown et al., 2016; Deiner et al., 2016; Hänfling et al., 2016; Mächler et al., 2019; Pawlowski et al., 2018). DNA extracted from a water sample (Keck et al., 2022; Pawlowski et al., 2020; Taberlet et al., 2012) is amplified using PCR with relatively universal primers. This allows whole taxonomic groups to be targeted and sequenced simultaneously, without the need for physical collection of specimens. Large datasets of biodiversity information for a targeted taxonomic group from water samples are generated which can uncover biodiversity structure of whole lake or riverine systems (e.g., Altermatt et al., 2020; Hänfling et al., 2016). These data, in the form of sequence reads, are processed, quality checked, filtered, merged, error corrected, chimera cleaned and the resulting zero-radius Operational Taxonomic Units (zOTUs) or clustered Operational Taxonomic Units (OTUs) are then assigned to taxonomic names based on reference sequences (Antich et al., 2021; Brandt et al., 2021; Deiner et al., 2017; Mathon et al., 2021). Although there are several instances where taxonomy-free approaches may suffice (see Mächler et al., 2021), it is particularly important to assign a species, genus or family name for conservation purposes, such as the identification of invasive or endangered species. Ultimately, this linkage of sequence to taxonomic name allows researchers and practitioners to gain unprecedented information into biodiversity richness (Leray et al., 2022) and the ability to monitor ecosystems with a new perspective and minimal impact. However, as with any new method or technological advancement, the approaches that individual studies have used hitherto vary significantly. This is true for bioinformatic processing steps, with no clear consensus as to the best approach (e.g., Antich et al., 2021; Brandt et al., 2021; Mathon et al., 2021), although recent comparisons suggest some programs offer advantages over others in terms of processing time (see Mathon et al., 2021). Recently, several pipelines have been developed specifically to handle eDNA data and tackle the issue of processing raw reads to taxonomic assignment in a consistent way, which go some way in making data processing standardized (e.g., Anacapa (Curd et al., 2019); SLIM (Dufresne et al., 2019); PEMA (Zafeiropoulos et al., 2020); or eDNAFlow (Mousavi-Derazmahalleh et al., 2021)).
Although standardized data processing may be desirable, it also has its drawbacks. Processing must be adapted to the data structure and data losses during data processing must be explainable (see also Keck & Altermatt, 2022). Especially, in eDNA projects, removal or loss of data can also mean a loss of information. Furthermore, classic approaches to processing sequence reads, such as a 97% similarity cluster, may underestimate diversity, while the zOTU approach often overestimates diversity (Brandt et al., 2021). Whichever approach is used, they all ultimately rely on the accuracy and completeness of a reference database for subsequent taxonomic assignment (Dugal et al., 2022; Jackman et al., 2021). Reference databases therefore should be a major consideration of any metabarcoding study as they underpin the successful and accurate taxonomic assignments (de Santana et al., 2021; Rodríguez-Ezpeleta et al., 2021). Ideally, these databases consist of verified barcodes, including taxa known to occur in the study area (Taberlet et al., 2007). Unfortunately, the perfect reference database does not exist (yet), and databases are scarce for large organismal groups. Often, studies are forced to use incomplete reference databases, which sometimes have inaccurate or mis-labelled sequences (Somervuo et al., 2017) that can ultimately prohibit detection or assignment of species level identification to a cluster or zOTUs (Blackman et al., 2021; Li et al., 2022; Weigand et al., 2019).
In most studies, authors build or select databases they see as particularly suited to their eDNA data, and the choice of database often has a strong pragmatic component as well, especially with respect to the use of openly accessible databases. Commonly, databases are selected based on their size (i.e., number of taxa or sequences included), implying that a larger number of sequences means a higher chance of more or better taxonomic assignment. Consequently, metabarcoding studies source their reference sequences from publicly available sequence repositories, such as GenBank or BOLD. However, these publicly available reference databases (primary or secondary) are not always the best solution (Pentinsaari et al., 2020). Sequences submitted to public (open access) sequence databases, such as GenBank, do not require a validation step after submission (Kozlov et al., 2016), such that taxonomically incorrect or incomplete entries occur. Also, they are biased toward well-studied taxa (Meiklejohn et al., 2019; Porter & Hajibabaei, 2018; Schroeter et al., 2020), which means their suitability may not always be appropriate to detect rare or understudied taxa. It is also not advisable to merge sequences from several databases because duplications and deviating nomenclatures could counteract the increase in sequence diversity. Perhaps, the most widely used database of open access sequences is GenBank (Benson et al., 2013). Until recently, it was widely thought to have a high proportion of errors, yet analysis suggests that—although discrepancies are present—when examining taxonomic assignment above species level these errors are less than previously thought (Leray et al., 2019; Locatelli et al., 2020), yet geographic biases still exist (Li et al., 2022). These findings are encouraging; but similarly other studies have conversely highlighted issues when examining species level annotation, such as Conte-Grande and colleagues (2017) who found 16.3% of all snakehead (Channidae) sequences (COI) on GenBank were in fact incorrect. Stringent curation steps are therefore vital when using sequences from large databases to ensure against incorrectly identified or poor-quality sequences which will lead to incorrect assignments, particularly if the aim is species level identification (Locatelli et al., 2020).
A solution to the above shortcoming is using a target or group specific reference databases where sequences have been verified. These reference databases, however, may also have varying degrees of coverage in terms of species' representation. Weigand et al. (2019) carried out a gap-analysis of the sequences available for European bioindicator groups (fish, macroinvertebrates, diatoms and plants). Of all groups, fish are generally best represented in BOLD/Genbank, and MitoFish, respectively, yet with both substantial geographic and taxonomic gaps remaining (Froese & Pauly, 2021; Iwasaki et al., 2013; Marques et al., 2021; Polanco et al., 2021). It is therefore not surprising that fish eDNA studies often include additional specimens collected within their study area to increase their reference database and fully exploit the benefits of metabarcoding (Cilleros et al., 2019; Schenekar et al., 2020). This was well shown by Schenekar et al. (2020), who's reanalysis of their metabarcoding data with a larger study specific reference database not only revealed further biodiversity but also changed some taxonomic assignment from one species to another (Schenekar et al., 2020). The change in taxonomic assignment is of particular importance as it also demonstrates that using limited sequence diversity and/or incomplete species-level taxonomic investigation, below a certain dissimilarity threshold, false positives can be generated due to “forcing” assignments by sequence similarities during taxonomic assignment processing. Similarly, relatively closely related species (i.e., in the first one to two million years after speciation) commonly found in diverse systems can usually not be distinguished by single stretches of sequence variation (e.g., Jackman et al., 2021), such as short barcodes. If not fully represented in the reference databases, groups of closely related species in the data will erroneously be assigned to a single species, potentially leading to wrong estimates of species diversity.
To explore the effect of reference databases on taxonomic assignment we used (i) publicly available (open) databases and (ii) a curated study specific (closed) reference database to compare the taxonomic assignment of freshwater fish from eDNA samples taken across Switzerland. The former are generally publicly accessible, and thus widely used, while the latter is a “close to perfect” version acting as a gold-standard, also to evaluate the formers' potential or limits. We use two publicly accessible reference databases as examples of easily accessible open references, namely MIDORI (Leray et al., 2018) and MitoFish (Iwasaki et al., 2013). We then supplemented the later with important consensus sequences, which we will refer to as MitoFish+. The closed or custom-built reference database used in this study is made up of sequences from tissue samples collected from two very broad quantitative fish biodiversity surveys (Alexander & Seehausen, 2021) across Switzerland. These surveys contain extensive collections and sequences of multiple specimens from nearly all fish species in Switzerland and included multiple geographical populations of each. As such, it represents an example of what would be considered by many as the ideal reference database for taxonomic assignment.
METHODS Sample collectionEnvironmental DNA samples were collected from 92 river sites in 2019 as part of a nationwide fish monitoring campaign in Switzerland (see Brantschen et al., 2021, Figure S1 and Table S1 for further details). For each of the sites, a total of 2 L of water was filtered using 4 Sterivex filters with a 0.22 μm pore size (Merck Millipore, Merck KgaA, Darmstadt, Germany). Filters were sealed with luer fitting and placed in a cool box for transporting to the laboratory where samples were stored at −20°C until further processing. Field negative controls, consisting of 2 L of ddH2O, were filtered and stored in the same manner as the samples.
DNA extractions from filters were performed in a clean room environment at Eawag, Switzerland (Deiner et al., 2015). The DNA was extracted using the QiAgen PowerWater Sterivex Extraction Kit (Qiagen, Hilden, Germany). Filters from different sites were extracted in random batches including field and filter control that were treated equally to the samples. Extractions were performed as described by the manufacturer protocol. DNA was eluted into 100 μl of elution buffer and stored until further processing at –20°C. A two-step library preparation method was used targeting the hypervariable region of 12S rRNA gene which amplicon ranges from 163 to 185 bp using MiFish-U-F/R primer pair (forward primer sequence: 5′-GTCGGTAAAACTCGTGCCAGC-3′ and reverse primer sequence: 5′-CATAGTGGGGTATCTAATCCCAGTTTG-3′) (Miya et al., 2015) hereafter known as the MiFish primer pair. These primers were modified to include the Nextera transposase sequences. Negative controls (field and PCRs), positive controls (PCP, containing a tissue DNA extract from Atlantic Cod, Gadus morhua, see Table S3), and samples were randomized over each 96-well PCR plates.
The first PCR reaction contained 0.5 μM each primer, 0.4 mg/ml BSA, 12.5 μl Q5 High Fidelity 2× Master Mix (New England Biolabs), and 2 μl template DNA. The PCR profiles were as follows: initial denaturation of 98°C for 5 min, followed by 35 cycles of 98°C for 10 s, 65°C for 20 s, and 72°C for 30 s, and a final extension step of 72°C for 7 min. The first PCR was carried out in triplicate and samples were pooled and cleaned using SPRI beads (Applied Biological Materials Inc.) prior to the second PCR. Second PCRs were carried out using 15 μl of cleaned PCR product and the Nextera XT Index Kit v2 (Illumina), following the subsequent PCR profile: initial denaturation 95°C for 3 min followed by 10 cycles of 95°C for 30 s, 58°C for 30 s, and 72°C for 30 s, and a final extension step of 72°C for 5 min. Negative controls and the addition of positive controls were processed in parallel with all samples. All PCR products were visualised using QiAxcel Advanced System by using a High-Resolution Cartridge (Qiagen, Hilden, Germany) and cleaned once more using SPRI beads. Samples were then quantified using the Spark 10M Multimode Microplate Reader (Tecan Group Ltd.) using the Qubit dsDNA BR assay (Thermo Fisher) and pooled equimolar. The libraries were loaded at 17.6 pM concentration, with 10% PhiX control. A paired-end 600 cycle (2 × 300 nt) sequencing was performed on an Illumina MiSeq (MiSeq Reagent Kit v3) following the manufacturer's run protocols (Illumina).
BioinformaticsTo process the raw data, we used a standardized but parameter-optimized workflow (Figure 1). In short, the raw reads are first filtered to remove PhiX related (an internal standard), and low complexity reads. In a next step, the low-quality 3′-end are trimmed to improve read merging. We used an in-silico PCR approach to remove the primer site from the merged reads (amplicons). Subsequently, the amplicons were subjected to a quality and a size-range filter. The cleaned amplicons were de-replicated prior to clustering with Usearch::UNOISE (Edgar, 2016a). UNOISE3 includes error correction, zero-radius clustering and chimera removal (Figure S2). We used an abundance threshold of 10 to remove artificially created and therefore untrustworthy singletons and rare zOTUs.
FIGURE 1. Data preparation workflow. The flow diagram shows the bioinformatic steps from the classic approach (raw read output from the MiSeq) and OSU approach (reference sequences) to the taxonomic assignment.
We used two different methods to generate count tables with taxonomic associations. The first method is the classic approach of using a sequence reference to annotate the zOTUs. For this method, we used Usearch::SINTAX (Edgar, 2016b). SINTAX is a fast and reliable tool in combination with the right reference database and appropriate confidence cut-off. We used two publicly available 12S ribosomal RNA databases (i.e., MitoFish and MIDORI). MitoFish is a fish-specific mitochondrial genome reference database, while MIDORI has a broader species range including Eukaryota. Both references databases are well maintained sequence collections and are extremely helpful for taxonomic annotation. Nevertheless, both also have limits in terms of species diversity, accuracy and completeness, depending on the research question. We wanted to increase the annotation range of the MitoFish reference database by adding more nonfish-related sequences. The primary purpose of the reference extension was to better understand the high numbers of zOTUs with missing taxonomic assignments. For this reason, we blasted (blastn) all the badly annotated zOTUs against the NCBI nucleotide database (Altschul et al., 1990). The Blast hits were bit-score (>200) and identity (>80%) filtered and only fully annotated sequences were used. By expanding the diversity of the existing MitoFish reference, we could characterize nontarget (i.e., co-amplified) species.
Reference databases summaryMitoFish (v362) (
MitoFish+ is an extended version of the MitoFish database, containing additional 372 nonfish-related sequences from NCBI, including bacteria.
MIDORI 12S (GB240) (
As an alternative to the classic approach from above, we explored the use of a project-specific custom build 12S fish reference database. To do this, we barcoded tissue samples of fish species collected within the same geographical range of our surveys (See Closed reference database below for more details). The full sequenced fragment was 736 bp long and then trimmed to the MiFish fragment of 192 bp long. We de-replicate the reference sequences (n = 912) to get unique sequences (n = 107, 11.9%), which we call Operational Sequence Units (OSUs) (Figure S3). Sequences generated for the closed reference database have been deposited in BOLD and GenBank (See Data availability statement). We supplemented these sequences with 12S reference for the following sequences: a consensus sequence for Gadus chalcogrammus and Gadus morhua from the MitoFish reference, which is used in as our positive control and not found within the geographic range of our surveys. Then, we included, based on the NCBI Blast results of zOTUs, nonfish sequences to act as outgroups, namely: Homo sapiens, Sus scrofa, Sturnus vulgaris, Rupicapra rupicapra, Ichthyosaura alpestris, Bos taurus and Turdus merula. We therefore have a total of 116 OSUs to map the reads to (Table 1). For each OSU label, there are three elements to demonstrate the support for that taxonomic assignment derived from the reference database (see Table 1). First, the taxonomic assignment is derived from the consensus of the sequences which have clustered to form that OSU. If all sequences are from the same species, the species name is used for the taxonomic assignment, if the sequences differ, in that there are several different species from the same genus, the genus name is used for taxonomic assignment. Second, the remaining elements of the name are derived from the number of repeating clusters with that name within the dataset, for example: 1 of 2 and 2 of 2 would indicate that there are two OSUs with the same taxonomic assignment in the dataset. Thirdly, the number of sequences used to form the cluster for example: n8 would include 8 sequences. Once the OSUs were formed, reads were mapped to the OSUs using Usearch:: cluster_fast with two different parameters: sequence identity (ID) and query coverage (QC). To test mapping performance, different parameters were tested to demonstrate the variation in reads mapped caused by these parameters (Table 2).
TABLE 1 Closed reference database fish OSUs
OSU name | Genus or species level | Number of haplotypes within OSU | No. of supporting sequences | Present in dataset | Description |
Abramis_brama_1of1_n11 | Abramis brama | 1 | 11 | Yes | Fish |
Alburnoides_bipunctatus_1of2_n9 | Alburnoides bipunctatus | 2 | 9 | Yes | Fish |
Alburnoides_bipunctatus_2of2_n1 | Alburnoides bipunctatus | 2 | 1 | Yes | Fish |
Alburnus_alburnus_1of1_n11 | Alburnus alburnus | 1 | 11 | Yes | Fish |
Alburnus_arborella_1of2_n9 | Alburnus arborella | 2 | 9 | Yes | Fish |
Alburnus_arborella_2of2_n1 | Alburnus arborella | 2 | 1 | No | Fish |
Alosa_sp_1of1_n8 | Alosa sp. | 1 | 8 | No | Fish |
Ameiurus_melas_1of1_n5 | Ameiurus melas | 1 | 5 | No | Fish |
Anguilla_anguilla_1of1_n21 | Anguilla anguilla | 1 | 21 | Yes | Fish |
Barbatula_sp_lineage_I_1of5_n18 | Barbatula sp. lineage I | 5 | 18 | Yes | Fish |
Barbatula_ sp _lineage_I_2of5_n1 | Barbatula sp. lineage I | 5 | 1 | Yes | Fish |
Barbatula_ sp _lineage_I_3of5_n2 | Barbatula sp. lineage I | 5 | 2 | Yes | Fish |
Barbatula_ sp _lineage_I_4of5_n1 | Barbatula sp. lineage I | 5 | 1 | Yes | Fish |
Barbatula_ sp _lineage_I_5of5_n2 | Barbatula sp. lineage I | 5 | 2 | Yes | Fish |
Barbatula_ sp _lineage_II_1of1_n18 | Barbatula sp. lineage II | 1 | 18 | Yes | Fish |
Barbatula_quignardi_1of1_n7 | Barbatula quignardi | 1 | 7 | Yes | Fish |
Barbus_barbus_1of3_n10 | Barbus barbus | 3 | 10 | Yes | Fish |
Barbus_barbus_2of3_n1 | Barbus barbus | 3 | 1 | Yes | Fish |
Barbus_barbus_3of3_n1 | Barbus barbus | 3 | 1 | Yes | Fish |
Barbus_caninus_1of1_n1 | Barbus caninus | 1 | 1 | Yes | Fish |
Barbus_plebejus_1of2_n6 | Barbus plebejus | 2 | 6 | Yes | Fish |
Barbus_plebejus_2of2_n2 | Barbus plebejus | 2 | 2 | Yes | Fish |
Blicca_bjoerkna_1of1_n8 | Blicca bjoerkna | 1 | 8 | Yes | Fish |
Bos_taurus | Bos taurus | NA | NA | Yes | Cow |
Bos_taurus | Bos taurus | NA | NA | No | Cow |
Carassius_gibelio_1of1_n7 | Carassius gibelio | 1 | 7 | Yes | Fish |
Chondrostoma_nasus_1of2_n2 | Chondrostoma nasus | 2 | 2 | Yes | Fish |
Chondrostoma_nasus_2of2_n4 | Chondrostoma nasus | 2 | 4 | Yes | Fish |
Chondrostoma_soetta_1of1_n2 | Chondrostoma soetta | 1 | 2 | Yes | Fish |
Cobitis_bilineata_1of1_n18 | Cobitis bilineata | 1 | 18 | No | Fish |
Coregonus_heglingus_1of1_n1 | Coregonus heglingus | 1 | 1 | Yes | Fish |
Coregonus_sp_1of3_n70 | Coregonus sp. | 3 | 70 | Yes | Fish |
Coregonus_sp_2of3_n1 | Coregonus sp. | 3 | 1 | Yes | Fish |
Coregonus_sp_3of3_n1 | Coregonus sp. | 3 | 1 | Yes | Fish |
Cottus_1of3_n37 | Cottus sp. | 3 | 37 | Yes | Fish |
Cottus_2of3_n22 | Cottus sp. | 3 | 22 | Yes | Fish |
Cottus_3of3_n4 | Cottus sp. | 3 | 4 | Yes | Fish |
Cyprinus_carpio_1of1_n11 | Cyprinus carpio | 1 | 11 | Yes | Fish |
Esox_cisalpinus_1of2_n7 | Esox cisalpinus | 2 | 7 | Yes | Fish |
Esox_cisalpinus_2of2_n1 | Esox cisalpinus | 2 | 1 | Yes | Fish |
Esox_lucius_1of2_n19 | Esox lucius | 2 | 19 | Yes | Fish |
Esox_lucius_2of2_n1 | Esox lucius | 2 | 1 | Yes | Fish |
Gadus_chalcogrammus_Consensus | Gadus chalcogrammus | NA | NA | Yes | Fish - positive |
Gadus_chalcogrammus_Consensus | Gadus chalcogrammus | NA | NA | No | Fish - positive |
Gadus_morhua_Consensus | Gadus morhua | NA | NA | Yes | Fish - positive |
Gadus_morhua_Consensus | Gadus morhua | NA | NA | No | Fish - positive |
Gasterosteus_sp_1of1_n12 | Gasterosteus sp. | 1 | 12 | Yes | Fish |
Gobio_gobio_1of1_n14 | Gobio gobio | 1 | 14 | Yes | Fish |
Gobio_obtusirostris_1of2_n5 | Gobio obtusirostris | 2 | 5 | Yes | Fish |
Gobio_obtusirostris_2of2_n1 | Gobio obtusirostris | 2 | 1 | Yes | Fish |
Gymnocephalus_cernua_1of2_n11 | Gymnocephalus cernua | 2 | 11 | Yes | Fish |
Gymnocephalus_cernua_2of2_n10 | Gymnocephalus cernua | 2 | 10 | Yes | Fish |
Homo_sapiens_Consensus | Homo sapiens | NA | NA | Yes | Human |
Homo_sapiens_Consensus | Homo sapiens | NA | NA | No | Human |
Ichthyosaura_alpestris | Ichthyosaura alpestris | NA | NA | Yes | Newt |
Ichthyosaura_alpestris | Ichthyosaura alpestris | NA | NA | No | Newt |
Lampetra_planeri_1of1_n4 | Lampetra planeri | 1 | 4 | No | Fish |
Lepomis_gibbosus_1of1_n10 | Lepomis gibbosus | 1 | 10 | Yes | Fish |
Leucaspius_delineatus_1of1_n2 | Leucaspius delineatus | 1 | 2 | Yes | Fish |
Leuciscus_leuciscus_1of3_n16 | Leuciscus leuciscus | 3 | 16 | Yes | Fish |
Leuciscus_leuciscus_2of3_n3 | Leuciscus leuciscus | 3 | 3 | Yes | Fish |
Leuciscus_leuciscus_3of3_n1 | Leuciscus leuciscus | 3 | 1 | Yes | Fish |
Lota_lota_1of1_n16 | Lota lota | 1 | 16 | Yes | Fish |
Micropterus_salmoides_1of1_n5 | Micropterus salmoides | 1 | 5 | No | Fish |
Neogobius_kessleri_1of1_n2 | Neogobius kessleri | 1 | 2 | No | Fish |
Neogobius_melanostomus_1of1_n3 | Neogobius melanostomus | 1 | 3 | Yes | Fish |
Oncorhynchus_mykiss_1of2_n1 | Oncorhynchus mykiss | 2 | 1 | Yes | Fish |
Oncorhynchus_mykiss_2of2_n7 | Oncorhynchus mykiss | 2 | 7 | Yes | Fish |
Padogobius_bonelli_1of1_n9 | Padogobius bonelli | 1 | 9 | No | Fish |
Perca_fluviatilis_1of1_n30 | Perca fluviatilis | 1 | 30 | Yes | Fish |
Phoxinus_csikii_1of4_n2 | Phoxinus csikii | 4 | 2 | Yes | Fish |
Phoxinus_csikii_2of4_n1 | Phoxinus csikii | 4 | 1 | Yes | Fish |
Phoxinus_csikii_3of4_n26 | Phoxinus csikii | 4 | 26 | Yes | Fish |
Phoxinus_csikii_4of4_n1 | Phoxinus csikii | 4 | 1 | Yes | Fish |
Phoxinus_lumaireul_1of2_n6 | Phoxinus lumaireul | 2 | 6 | Yes | Fish |
Phoxinus_lumaireul_2of2_n2 | Phoxinus lumaireul | 2 | 2 | Yes | Fish |
Phoxinus_septimaniae_1of1_n18 | Phoxinus septimaniae | 1 | 18 | Yes | Fish |
Phoxinus_sp_1of1_n2 | Phoxinus sp. | 1 | 2 | Yes | Fish |
Pseudorasbora_parva_1of1_n7 | Pseudorasbora parva | 1 | 7 | Yes | Fish |
Rhodeus_amarus_1of3_n3 | Rhodeus amarus | 3 | 3 | No | Fish |
Rhodeus_amarus_2of3_n11 | Rhodeus amarus | 3 | 11 | No | Fish |
Rhodeus_amarus_3of3_n1 | Rhodeus amarus | 3 | 1 | No | Fish |
Rupicapra_rupicapra | Rupicapra rupicapra | NA | NA | Yes | Chamois |
Rupicapra_rupicapra | Rupicapra rupicapra | NA | NA | No | Chamois |
Rutilus_aula_1of1_n11 | Rutilus aula | 1 | 11 | Yes | Fish |
Rutilus_pigus_1of2_n1 | Rutilus pigus | 2 | 1 | No | Fish |
Rutilus_pigus_2of2_n9 | Rutilus pigus | 2 | 9 | No | Fish |
Rutilus_rutilus_1of2_n20 | Rutilus rutilus | 2 | 20 | Yes | Fish |
Rutilus_rutilus_2of2_n3 | Rutilus rutilus | 2 | 3 | Yes | Fish |
Sabanejewia_larvata_1of1_n1 | Sabanejewia larvata | 1 | 1 | No | Fish |
Salaria_fluviatilis_1of3_n6 | Salaria fluviatilis | 3 | 6 | Yes | Fish |
Salaria_fluviatilis_2of3_n1 | Salaria fluviatilis | 3 | 1 | Yes | Fish |
Salaria_fluviatilis_3of3_n14 | Salaria fluviatilis | 3 | 14 | Yes | Fish |
Salmo_carpio_1of1_n1 | Salmo carpio | 1 | 1 | Yes | Fish |
Salmo_salar_1of1_n3 | Salmo salar | 1 | 3 | Yes | Fish |
Salmo_sp_1of3_n47 | Salmo sp. | 3 | 47 | Yes | Fish |
Salmo_sp_2of3_n11 | Salmo sp. | 3 | 11 | Yes | Fish |
Salmo_sp_3of3_n2 | Salmo sp. | 3 | 2 | Yes | Fish |
Salmo_trutta_1of2_n1 | Salmo trutta | 2 | 1 | Yes | Fish |
Salmo_trutta_2of2_n1 | Salmo trutta | 2 | 1 | Yes | Fish |
Salvelinus_fontinalis_1of1_n6 | Salvelinus fontinalis | 1 | 6 | Yes | Fish |
Salvelinus_namaycush_1of1_n4 | Salvelinus namaycush | 1 | 4 | Yes | Fish |
Salvelinus_sp_1of1_n21 | Salvelinus sp. | 1 | 21 | Yes | Fish |
Sander_lucioperca_1of1_n9 | Sander lucioperca | 1 | 9 | Yes | Fish |
Scardinius_erythrophthalmus_1of3_n11 | Scardinius erythrophthalmus | 3 | 11 | Yes | Fish |
Scardinius_erythrophthalmus_2of3_n1 | Scardinius erythrophthalmus | 3 | 1 | Yes | Fish |
Scardinius_erythrophthalmus_3of3_n1 | Scardinius erythrophthalmus | 3 | 1 | Yes | Fish |
Scardinius_hesperidicus_1of2_n5 | Scardinius hesperidicus | 2 | 5 | Yes | Fish |
Scardinius_hesperidicus_2of2_n6 | Scardinius hesperidicus | 2 | 6 | Yes | Fish |
Silurus_glanis_1of1_n12 | Silurus glanis | 1 | 12 | No | Fish |
Squalius_cephalus_1of1_n13 | Squalius cephalus | 1 | 13 | Yes | Fish |
Squalius_squalus_1of2_n9 | Squalius squalus | 2 | 9 | Yes | Fish |
Squalius_squalus_2of2_n1 | Squalius squalus | 2 | 1 | Yes | Fish |
Sturnus_vulgaris | Sturnus vulgaris | NA | NA | Yes | Starling |
Sturnus_vulgaris | Sturnus vulgaris | NA | NA | No | Starling |
Sus_scrofa | Sus scrofa | NA | NA | Yes | Pig |
Sus_scrofa | Sus scrofa | NA | NA | No | Pig |
Telestes_muticellus_1of1_n12 | Telestes muticellus | 1 | 12 | Yes | Fish |
Telestes_souffia_1of1_n14 | Telestes souffia | 1 | 14 | Yes | Fish |
Thymallus_thymallus_1of3_n1 | Thymallus thymallus | 3 | 1 | Yes | Fish |
Thymallus_thymallus_2of3_n9 | Thymallus thymallus | 3 | 9 | Yes | Fish |
Thymallus_thymallus_3of3_n2 | Thymallus thymallus | 3 | 2 | Yes | Fish |
Tinca_tinca_1of1_n29 | Tinca tinca | 1 | 29 | Yes | Fish |
Turdus_philomelos | Turdus philomelos | NA | NA | Yes | Thrush |
Turdus_philomelos | Turdus philomelos | NA | NA | No | Thrush |
Note: The 116 Operational Sequence Units derived from the 912 Sanger sequences collected as part of this study. To provide certainty, the label of each OSU is broken into the elements used to make that taxonomic assignment: the consensus taxonomic label derived from the sequences in the cluster (green) + No. of haplotypes (blue) + No. of supporting sequences (orange). For example, Abramis_brama_1of1_n11 is an OSU to species level with only 1 OSU in the reference database and 9 sequences supporting that OSU level taxonomic name. Salmo_sp_1of3_n47 is an OSU for genus level because the sequences which support this OSU are from different Salmo species, this OSU is 1 of 3 Salmo sp. OSU and has 47 sequences supporting this OSU level taxonomic name.
TABLE 2 Test for mapping parameter and efficiency
ID | QC | Reads mapped | % Mapped | No. of OSUs |
1 | 1 | 4,683,758 | 52.7 | 95 |
1 | 0.99 | 4,684,009 | 52.7 | 95 |
1 | 0.98 | 4,684,012 | 52.7 | 95 |
1 | 0.97 | 4,684,012 | 52.7 | 95 |
0.99 | 1 | 6,991,312 | 78.6 | 99 |
0.99 | 0.99 | 6,991,768 | 78.6 | 99 |
0.99 | 0.98 | 6,991,773 | 78.6 | 99 |
0.99 | 0.97 | 6,991,775 | 78.6 | 99 |
0.98 | 1 | 7,753,236 | 87.2 | 101 |
0.98 | 0.99 | 7,753,792 | 87.2 | 101 |
0.98 | 0.98 | 7,753,802 | 87.2 | 101 |
0.98 | 0.97 | 7,753,806 | 87.2 | 101 |
0.97 | 1 | 7,877,395 | 88.6 | 101 |
0.97 | 0.99 | 7,878,060 | 88.6 | 101 |
0.97 | 0.98 | 7,878,082 | 88.6 | 101 |
0.97 | 0.97 | 7,878,086 | 88.6 | 101 |
Note: Number and percentage of sequences mapped to the total of 116 OSUs using Usearch::cluster_fast with different sequence identity (ID) and query coverage (QC) parameters. The parameters highlighted in bold were used to generate the data in this study.
Closed reference databaseRecently, two large-scale projects were carried out to quantitatively survey the fish biodiversity in the lakes of Switzerland and adjacent perialpine regions (Projet Lac, 2010–2018; Alexander & Seehausen, 2021), and in the rivers of Switzerland (Progetto Fiumi, 2013–2018). More than 150,000 fish individuals were sampled and identified. Population genetic studies were performed for many complexes of closely related species to assess species boundaries (e.g., for the genera Coregonus, Salvelinus, Cottus and Phoxinus). From both projects, specimen and tissue reference collections were established for over 18,000 specimens at the Natural History Museum of Bern (NMBE). The closed reference database used in this study is a collection of 12S rRNA sequences of fish species from 912 of these specimens, representing all fish species, and multiple populations of most, collected in the Projet Lac and Progetto Fiumi. The collection represents 110 of the 124 species known to occur in Switzerland (Table S4). Total genomic DNA was extracted from muscle tissue or fin clips preserved in 100% ethanol and stored at –80°C using the DNeasy Blood and Tissue Kit on a QIAcube robotic workstation following manufacturer's instructions (Qiagen, Hilden, Germany). Partial 12S rRNA fragments were amplified using the primers MiFish_U_F (Miya et al., 2015) and Valentini_tele01_H1913 (Valentini et al., 2016). PCR protocol and PCR conditions follow Conte-Grand et al. (2017). PCR products were cleaned, and Sanger sequenced using both PCR primers by LGC Genomics, Berlin. Raw reads were edited and assembled into contigs using Geneious Prime v2022.0.2 (
The MiSeq run from the eDNA samples generated 14.27 million reads passing the default quality filter with Q30 of 79.13%. We used the MiSeq Control Software (MCS v3.1) to de-multiplex the data, which resulted in 10,966,567 paired-end raw reads. The median number of read per sample was 20,450 (range 8 to 93,441). After data processing, we had 8,889,951 (81%) high-quality amplicons for the downstream analysis (Figure 1). The percentage of high-quality amplicons that could be mapped back to the zOTUs was high, 98.5% (Figure S2). Overall, the abundance of reads assigned to each zOTUs and OSUs was relatively similar, with only minor variation (Figure S3). An overview of the quality control and information on data lost through filtering in the bioinformatic steps can be found in the DRYAD repository (See data availability statement).
Comparison between the three reference databasesThe number of zOTUs with annotation to species, genus or family level was similar for the three reference databases (Figure 2). The most common zOTUs were the same with each reference database, both in terms of identity and level of taxonomic resolution (i.e., Cottus, Salmo, Squalis cephalus, other zOTUs assigned to Cyprinidae). All zOTUs were annotated, though some of these annotations are not beyond Class level: 3.56%, 1.22% and 1.28% of the total abundance MitoFish, MitoFish+ and MIDORI, respectively (Figure 2). However, these taxonomic annotations can be misleading. As missing clades or outgroups are the Achilles' heel of classifiers, the actual percentage of missing annotation may be much higher. All records in the MitoFish reference have the same kingdom (Eukaryota), phylum (Chordata) and class (Actinopteri), therefore all zOTUs are annotated at least to class level “Actinopteri.” The MIDORI reference is more diverse, but all records are Eukaryotes and so all unidentified noneukaryotic zOTUs will be mis-labelled as “Eukaryotes” as the highest taxonomic grouping. Considering the diversity limits of the reference database, the annotation failed for 98.34% of the zOTUs using MitoFish and 96.96% using MIDORI as annotation reference beyond family level. This result shows that not all zOTUs generated in this study are fish related. To improve the annotation, we added nonfish related sequences to the MitoFish reference (MitoFish+) based on Blast hits using zOTUs with missing or bad (low taxonomic rank or unclear labels like “environmental samples”) annotations as queries. This expanded MitoFish+ reference database included outgroups like bacteria. Surprisingly, 71% of the zOTUs are assigned to bacteria and are thus not of eukaryotic origin. This seems a lot, but these bacteria zOTUs represented only 6.9% of the total reads (abundance), therefore bacteria zOTUs are abundant (in terms of species number) but rare in our dataset in terms of abundances (Table S2). It also shows that the annotations should be evaluated with caution and references might be adjusted to get more meaningful results.
FIGURE 2. A comparison (e.g., number and abundance) of the taxonomic annotation depth of the 50 most common zOTUs and OSUs. Either MitoFish, MitoFishPlus or MIDORI was used to annotate the zOTUs. (a) Number; (b) Abundance. The colors represent the taxonomic assignment level (species—yellow, genus—green and family—blue), with orange indicating the proportion of zOTUs without taxonomic assignment in the corresponding reference database.
Overall, we find that the predicted annotations for the most abundant zOTUs (those zOTUs with the highest read number) are similar to Mitofish, Mitofish+ or MIDORI as the annotation reference, with a few interesting exceptions. Although the MiFish primers are fish-specific, nontarget amplification is difficult to avoid. For this reason, the reference databases should cover a broader diversity (e.g., MIDORI) or be extended (MitoFish+). For example, zOTU13 and zOTU24 had no correct assignment with MitoFish. Using the MIDORI as a reference, zOTU13 was associated with Bovidae (Cow), possibly because of the addition of BSA in the PCR reaction and zOTU24 was identified as possible human contamination. There are also minor, but perhaps intriguing, differences in annotation between the open reference databases. One prime example to demonstrate this is the annotation of zOTU27, which is associated with Phoxinus phoxinus using MitoFish, while using MIDORI the annotation concurs but is only to family level (Leuciscidae). As the species P. phoxinus is not found in the study area (Alexander & Seehausen, 2021) it showcases the differing outcomes and reliability of using closed reference databases.
Closed reference database andThe number of unique OSUs per species varied from 1 (e.g., Tinca tinca) to maximal 5 (e.g., Barbatula sp. lineage I). The number of sequences support unique haplotypes varies from 1 (e.g., Salaria fluviatilis) to 47 (e.g., Salmo sp.) with a mean of 7.75. The sequence resolution for the 12S amplicon is good and most unique haplotype groups contain only one species exception occurred (e.g., Phoxinus or Salmo, see Table 1 for further OSU details). The ML tree of the unique haplotypes is shown in Figure S4. By using a closed reference database with the target taxa and supplemented with possible outgroups, here we assigned amplicons reads to a single OSU with the appropriate species or genus level taxonomic information (Figures 2 and 3). For the OSU read mapping, we use two parameters: Sequence identity (ID) and query coverage (QC). Stringent mapping criteria reduces mapping efficiency, while relaxed ones will cause false assignment. In Table 2, the percentage of reads mapped ranges from 52.7% to 88.6%, with the ID threshold having a bigger influence on the mapping rate than QC and therefore a higher number of OSUs being found when the ID threshold was lowered. The best results were obtained with the following parameters and used for subsequent mapping: ID: 97 and QC: 100. This resulted in 7,877,395 (88.6%) reads mapped to 101 OSUs (87%, shown in Figure S5). Compared to the open reference database taxonomic assignments, was entirely to genus/species level (8 taxonomic assignments were to genus and could not be resolved to species because of insufficient amplicon resolution, see Figure 3). Most of the reads which could not be mapped to any of the OSUs were bacteria.
FIGURE 3. The top 50 OSU. Taxonomic assignment using this approach was to either genus or species level. The taxonomic assignment includes the species/genus name, the number of OSUs with the same annotation in the dataset (i.e., 1 of 2, 2 of 2) and number of sequences supporting this taxonomic assignment (genus—green, species—yellow). See also Table 1. Text color indicates taxa type: blue—fish species or genus, black—mammals and green positive control.
We used four positive control samples (PCP) in this library, containing a tissue DNA extract from Atlantic Cod, Gadus morhua (See Table S3 for sequence). Using positive control samples in this way enables us to identify possible contamination levels between samples (cross-contamination or tag-switching) and is a more effective method compared to using negative controls. The classic zOTU (MIDORI) and the OSU approach show a similar relative composition (Figure 4). In the classic approach, 72.9%–85.3% of reads are assigned to Family level Gadidae and in the OSU approach 71.7%–86.3% of the reads were assigned to Gadus morhua. The difference between the approaches lies in the taxonomic depth of the annotation.
FIGURE 4. Relative abundance heat map for the four positive control PCR samples (PCP). a. using the zOTU approach and the MIDORI reference database, b. using OSU approach. As a positive control we used Gadus morhua (Atlantic cod) tissue extract.
In both approaches, minor contamination (in the form of other taxa) was found, but as shown in Figure 4, the relative read abundance of these other taxa was low compared to the Gadidae or Gadus morhua assigned reads. Interestingly, when comparing the different taxa found in the positive samples with each of the two approaches, there is only a minor deviation between the two approaches in terms of taxonomic assignments. Using a minimum abundance threshold (200 reads), a total of 11 and 10 other taxa were found using the zOTU-MIDORI and OSU approach, respectively. Most of the annotations were identical (i.e., Salmo, Cottus, Squalis cephalus, Bos tarus, Esox lucius, Oncorhynchus mykiss, Sus scrofa, Homo sapiens) with a few exceptions: Phoxinus phoxinus (zOTU-MIDORI) vs. P. csikii (OSU) (See Figure 4) and Barbonymus schwanefeldii with zOTU-MIDORI, but not found with the OSU approach. As the detection of B. schwanefeldii was only found in the positive controls (206 reads on average, ranging from 0 to 523), we can assume this represents a true contamination associated with our positive control spike, and therefore would not be assigned using the OSU approach, as it was not in the closed reference database. In terms of the abundances of taxa found in all positive control samples, there is a striking similarity between the open and closed reference databases for those taxa with high abundance as is also the case for those taxa found in the actual samples (Figures S6 and S7). The variation in read % per taxon between the two approaches for those common taxa can be explained by the difference in species contained in each of the reference database approaches. Similarly, we find the detection of some taxa which could be considered rare (in low abundance) is more variable and often only found in one of the two approaches, as the species is likely absent from one of the reference databases.
Negative controlsAs with the positive controls, minor contamination was also found in the negative controls. This is likely to have occurred during library preparation and is typical of such eDNA studies where PCR amplification is highly efficient in those samples with minimal amplification competition, such as that from minor contamination. Only two of the eight negative control samples contained a total of >350 reads. The distribution of reads for these samples are shown in Figure S8 and are strikingly similar across each of the taxonomic assignment methods. The majority of the reads found belong to some of the most common taxa found in the samples, including the most and second most occurring species found in the dataset (Squalis cephalus and Salmo sp.). The six other samples contain on average 74 (range 12–374), 74 (range 12–374), and 14 (range 10–19) using the MIDORI, MitoFish+ and OSU reference databases, respectively.
PhylogenyTo compare the sequences derived from each approach, classic and OSU, we carried out a phylogenetic comparison. Figure 5 shows the top 20 most abundant zOTU (MIDORI) alongside the 20 most abundant OSU hits. The tree highlights both the disparity (e.g., Leucisidae) and similarity (e.g., Squalius cephalus) between taxonomic annotation levels using the classic zOTU (MIDORI) and OSU approaches. However, with Salmo sp. and Cottus sp. both approaches are the same to genus level only. This is because of the 12S amplicon length (in this case 192 bp). Species detection is the aim of most eDNA studies, however sequencing technologies often only target a small fragment of DNA, which is insufficient for species determination for some taxonomy.
FIGURE 5. A phylogenetic comparison of the top 20 zOTUs and OSUs. zOTUs are presented in blue and OSUs are in red, relative abundance is added to the label for both approaches.
Species detection derived from eDNA samples relies on reference sequences and involves linking genetic sequences (barcode regions) with taxonomic names. Here, we examined the influence of different types of reference databases (open/publicly available vs. closed/curated) and annotation approaches. Our study takes advantage of a curated fish reference database generated from multiple specimens of all fish species occurring across the study area (i.e., a nationwide collection in Switzerland) to explore fish communities derived from eDNA water samples collected in rivers and acts as a proof of principle study for the wider use of curated and study specific reference databases. Using a curated reference database, combined with a proposed OSU approach, we gained a higher proportion of species and genus level taxonomic assignment compared to the open reference databases used in this study (MitoFish, MitoFish+ and MIDORI). Although the annotation may be to different levels, encouragingly, the top 5 taxa generated from the classic zOTU and OSU approach are identical and equate to 53.03% and 63.56% of the taxonomically assigned reads generated from the two approaches, respectively. The quality and depth of the annotation in all reference databases depends on several factors, including species coverage, sequence representation, and reliability of the sequence. Consequently, we have synthesized the main points that should be considered when curating a reference database to ensure appropriate taxonomic assignment with full, verified taxonomic annotation (See Text Box 1). Here, we discuss how this will lead to more accurate and informative biodiversity data.
The perfect reference database is difficult to create, but here are some important factors to consider when constructing your own study specific reference database:
General considerations
- Species should be represented by multiple sequences.
- Species must be correctly identified.
- Sequences should be derived from reliable tissue samples.
- All species found in the geographic region of the study (from the target groups) should have representative sequences in the reference database.
- Sequences should be phylogenetically placed to ensure against incorrect labelling.
- All reference sequences used should cover the full length of the amplicon.
Specific considerations
- Reference databases must include outgroups to avoid problematic assignments due to co-amplification of nontarget groups.
- Consider an OSU approach for a more specific survey if sequences for target species are available.
- After assigning taxonomic names to your data, ensure its biological/ecological plausibility.
Basic considerations when constructing a reference database include species coverage, sequence representation, and reliability of the sequence. Firstly, a list of the target species within the study area should be made and the correct sequences available (in terms of length and marker) found for those species should be assembled (Blackman et al., 2021). This allows any gaps to be identified and therefore an immediate understanding of the limitation of your reference database and data interpretation (Li et al., 2022; Weigand et al., 2019). Secondly, multiple sequences should be used for each species to ensure haplotypic diversity (Leite et al., 2020). Notably, of the top 5 most abundant taxa identified in our study, Barbus barbus is identified at species level using the OSU (closed reference), but only at family level using the zOTU approach (open reference). The lack of species determination for B. barbus could be due to the lack of local species sequence in the open reference databases and the high intraspecific variation not represented in these databases (Leite et al., 2020; Weigand et al., 2019). The representation of the genetic diversity within a species is an important aspect to consider when curating a reference library. Thus, not only should multiple sequences for each species be included in the database, but also the origins of those sequences must be considered, in order to cover geographic and taxonomic breadth. By including geographic information when submitting sequences, studies can ensure species are represented by appropriate sequences (see also Li et al., 2022), which cover local genetic diversity and are relevant to the study area and the resulting data (Bergsten et al., 2012).
Once the sequences have been sourced for a reference database, essential taxonomic annotation checks should be applied to ensure their identity is correct and the annotations is complete. Leray et al. (2019) examined the potential reasons for these discrepancies in GenBank. Their study documented several reasons why the taxonomic annotation could be incorrect; firstly, the amplified sequence may not be from the target organism, either due to contamination or intimate association (i.e., bacteria). Secondly, pseudogene amplification and thirdly, incorrect identification of the target taxa initially. To address these issues when curating a reference database, sequences should be phylogenetically placed to ensure their annotation is correct. Mislabelled sequences can then be identified and removed from the database (for example, by using SATIVA; Kozlov et al., 2016). Thirdly, by systematically screening sequences based on their phylogeny, any sequences with inadequate annotation may be placed phylogenetically within the contexts of the other sequences within the database.
Reference database construction: Specific considerationsAs with several metabarcoding primers, the MiFish primers are known to amplify other nontarget organismal groups, including mammal and bird species (Ritter et al., 2022). In our study, we further examined those zOTUs that received a poor taxonomic annotation. The Blast hit results from these sequences showed that a proportion of our data was in fact assigned to Bacteria. We therefore recommend that reference databases routinely encompass outgroups, that is, a wider diversity than the study target group, as such sequences would otherwise be incorrectly (and phylogenetically too narrowly) assigned. Here, our closed reference database MitoFish+ and MIDORI included further diversity, as opposed to MitoFish, which is fish only. This approach will allow reads from nontarget groups to be assigned correctly rather than “forcing” taxonomic assignment to the represented taxa incorrectly. With this approach we revealed taxa from different classes and orders (e.g., see nontarget assignments in Figures S6 and S7). Other studies of fish communities derived from eDNA have previously demonstrated a shift in taxonomic assignment when reference databases are broadened and further diversity is included (Schenekar et al., 2020); therefore, the inclusion of such outgroups are not only advised, but essential in preventing false positives results. Confidence thresholds for the annotation are also crucial in this respect. We recommend using higher cut-offs (>85%) and a closer examination of the unannotated zOTU instead of lowering the confidence threshold.
By employing an OSU approach and assigning (mapping) reads to individual sequence representatives, we ensured the full taxonomic annotation and certainty of each assignment. In this study, we had access to a large custom-built reference database, not only encompassing the geographic range of our study sites in which eDNA samples were taken, but also with complete taxonomic annotation carried out by expert ichthyologists. The OSU approach worked well in mapping sequences to the OSUs in the database, but back-mapping efficiency varied (52.7%–88.6%) depending on the parameters used. The results show that identity (ID) had a bigger influence on efficiency than query coverage (QC). However, although more reads map to the OSUs with a lower ID, the fish-species composition does not change. There are also limitations to the OSU approach to consider. There are different ways of handling sequences which cannot be mapped clearly to one OSU, but to several. For example, Usearch::Sintax assigns these sequences to the first OSU, so the order of similar OSUs could play a role. In addition, different ways of calculating similarity might be considered when mapping the sequence to an OSUs. The position of a mismatch could be decisive in assigning a sequence to the correct OSU. Furthermore, the codon structure of amplicons of coding gene regions (e.g., COI) could improve the mapping. That said, the reference database we used here represents a near perfect set of sequences across our study area and target group. This form of reference database is close to the ideal, but required extensive investment, both in terms of time and financially. Attempts to replicate this form of reference database should not be disregarded and should (long term) be an aim of those researchers and end-users who can collect specimens and contribute to these databases.
In terms of reference database curation, a last but vital consideration when reviewing data should be its biological or ecological setting. In all metabarcoding and biodiversity studies, particularly those from eDNA samples, scientists must always contextualize their data to determine the success or failure of taxonomic assignment, particularly if a species is found outside its known habitat or range. As advice for data analysis, derived either from an open or closed reference database, the possible reasons other than natural occurrences must be considered, such as zoo or restaurant outflows into the water being sampled.
Further considerations when usingAlthough the OSUs represented 116 species/genera of fish that are known to occur in the sites we sampled, 15 autochthonous species known from the study area were not detected in this sampling campaign. This level of discrepancy is comparable to many other eDNA studies (e.g., see Keck et al., 2022). These absences could be either true negatives of the species at the sampling site or false negatives. The latter can derive from several biological and methodological constraints of collecting eDNA in flowing water, such as insufficient amounts of DNA shed by the target organism into the water body or similarly the heterogenous nature of eDNA within the water column meant the sample did not contain any of that species' DNA (as discussed in Bruce et al., 2021). Environmental DNA sampling, although not a perfect method, has vastly improved in the last 10 years, in terms of both usability for biomonitoring and our interpretation and understanding of the results we glean from collecting these samples. Nevertheless, one noticeable and further constraint of eDNA is the reliance on the amplification of small fragments of DNA. Typically, fragments <400 bp are targeted due to sequencing platform requirements (the 12S fragment here ~170 bp). By using such small fragments, confidence in assignment and taxonomic resolution is limited (Deiner et al., 2016; Port et al., 2016), as demonstrated in the genus level-only determination of certain taxa using both the classic and OSU approach (e.g., Salmo, Cottus, Coregonus and Phoxinus). With the 12S amplicon we can distinguish between genera, but not necessarily within. For example, relatively closely related species (i.e., in the first one to two million years after speciation) can usually not be distinguished by single stretches of sequence variation, such as short barcodes. Therefore, we cannot distinguish certain species, such as several Salmo species which are known from Switzerland (i.e., Salmo labrax and Salmo marmoratus), or virtually all the phylogenetically very young Coregonus species (which radiated within the last few ten thousand years, Jardim de Queiroz et al., 2022). With all reference approaches, the reads of the former all received the Salmo sp. assignments. This un-resolved assignment was therefore due to the methodological constraints of using a short DNA fragment unable to distinguish young species within this genus. Focusing on larger fragments or different regions may become more common practice in the future, but currently, when using such small fragments of DNA for fish biomonitoring, it is an important aspect to consider, particularly if the aim is species determination.
CONCLUSIONIncomplete or inaccurate reference databases have long been highlighted as a limiting factor to correctly assign metabarcoding data. Here, we demonstrated how locally derived curated reference databases improve species level assignments, which is a key goal of eDNA metabarcoding studies. However, we have also identified potential ways to improve and utilize current open reference databases with simple curation steps (i.e., phylogenetic placement, the addition of geographically relevant sequences, including outgroups, and removing incomplete or problematic annotated references), which will likely improve family or genus level assignment and help to explain data losses. We encourage researchers to consider the implications of their reference databases and curate them following our simple suggestions (See Text Box 1). We further support the improvement of reference sequence generation, including efforts to collect, morphologically identify, and sequence more specimens to fill gaps in current reference databases. These steps will lead to better reference databases and taxonomic assignment for biodiversity.
AUTHOR CONTRIBUTIONSRosetta C. Blackman, Jean-Claude Walser, and Florian Altermatt conceived the study. Jakob Brodersen performed the genomic lab analyses and Jean-Claude Walser performed the bioinformatics. Lukas Rüber, Ole Seehausen, Soraya Villalba, and Jakob Brodersen collected the fish, sequenced them, and constructed the closed reference database. Jean-Claude Walser performed the analysis. Rosetta C. Blackman and Jean-Claude Walser produced all figures and wrote the first draft. All authors contributed to the interpretation data and commented on the paper.
ACKNOWLEDGMENTSWe thank the Swiss Federal Office for the Environment (BAFU/FOEN) for access to the eDNA samples. We thank two anonymous reviewers for their comments on the manuscript. Funding is from the Swiss National Science Foundation (grant nr. 31003A_173074) and the University of Zurich Research Priority Programme in Global Change and Biodiversity (URPP GCB) to FA. The Federal Office for the Environment/Bundesamt für Umwelt (FOEN/BAFU) financed the establishment of the 12S DNA reference library of Swiss fishes under the contract 00.5058.PZ/6B1725F08 to LR and OS. Genetic analyses were done in collaboration with the Genetic Diversity Centre (GDC) at ETH Zurich.
CONFLICT OF INTERESTThe authors of this study have no conflict of interest to declare. The manuscript is original work. All contributing authors have seen and agreed with the content before submission. Furthermore, we guarantee that this study has not been submitted to any other journal.
DATA AVAILABILITY STATEMENTSequencing data generated during this study, the data analysis scripts and an overview of the quality control and information on data lost through filtering in the bioinformatic steps can be found on the DRYAD repository
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2023. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Metabarcoding of environmental DNA (eDNA) is a powerful tool for describing biodiversity, such as finding keystone species or detecting invasive species in environmental samples. Continuous improvements in the method and the advances in sequencing platforms over the last decade have meant this approach is now widely used in biodiversity sciences and biomonitoring. For its general use, the method hinges on a correct identification of taxa. However, past studies have shown how this crucially depends on important decisions during sampling, sample processing, and subsequent handling of sequencing data. With no clear consensus as to the best practice, particularly the latter has led to varied bioinformatic approaches and recommendations for data preparation and taxonomic identification. In this study, using a large freshwater fish eDNA sequence dataset, we compared the frequently used zero-radius Operational Taxonomic Unit (zOTU) approach of our raw reads and assigned it taxonomically (i) in combination with publicly available reference sequences (open databases) or (ii) with an OSU (Operational Sequence Units) database approach, using a curated database of reference sequences generated from specimen barcoding (closed database). We show both approaches gave comparable results for common species. However, the commonalities between the approaches decreased with read abundance and were thus less reliable and not comparable for rare species. The success of the zOTU approach depended on the suitability, rather than the size, of a reference database. Contrastingly, the OSU approach used reliable DNA sequences and thus often enabled species-level identifications, yet this resolution decreased with the recent phylogenetic age of the species. We show the need to include target group coverage, outgroups and full taxonomic annotation in reference databases to avoid misleading annotations that can occur when using short amplicon sizes as commonly used in eDNA metabarcoding studies. Finally, we make general suggestions to improve the construction and use of reference databases for metabarcoding studies in the future.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details







1 Department of Aquatic Ecology, Eawag, Swiss Federal Institute of Aquatic Science and Technology, Dübendorf, Switzerland; Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zürich, Switzerland
2 Genetic Diversity Centre (GDC), Department of Environmental Systems Science (D-USYS), Federal Institute of Technology (ETH), Zürich, Switzerland
3 Naturhistorisches Museum Bern, Bern, Switzerland; Division of Aquatic Ecology & Evolution, Institute of Ecology & Evolution, University of Bern, Bern, Switzerland
4 Naturhistorisches Museum Bern, Bern, Switzerland; Department of Fish Ecology & Evolution, Eawag, Swiss Federal Institute of Aquatic Science and Technology, Kastanienbaum, Switzerland
5 Division of Aquatic Ecology & Evolution, Institute of Ecology & Evolution, University of Bern, Bern, Switzerland; Department of Fish Ecology & Evolution, Eawag, Swiss Federal Institute of Aquatic Science and Technology, Kastanienbaum, Switzerland