Identifying archived insect bulk samples using

Full text

Turn on search term navigation

INTRODUCTION

Recent studies have highlighted that insect populations are declining in many parts of the world (Dirzo et al., 2014; van Klink et al., 2020). In the largest global assessment to date, Outwaite et al. (2022) examined insect species abundance, presence and richness for a wide range of taxa and showed an overall reduction in the abundance (50%) and diversity (27%) of species. Whilst it is acknowledged that the conservation status of insects is nuanced, the lack of data on global diversity and population trends is clear (Cardoso et al., 2011). This dearth of data largely stems from the difficulties associated with monitoring insects coupled with declining taxonomic expertise (Drew, 2011). Recently, DNA-based methods have shown promise as a way to scale-up biodiversity monitoring (Bush et al., 2017; Ji et al., 2013) coining the term “next-generation biomonitoring” (Taberlet et al., 2012). Given the rapid effects of human influence on the environment, there is a clear need for faster, more efficient and comprehensive techniques for biodiversity monitoring (Makiola et al., 2020). High-throughput sequencing (HTS)-based biomonitoring approaches like metabarcoding, mitogenomics, and metagenomics offer the advance needed for both insect monitoring and the difficulties associated with surveillance in the context of insect declines (Piper et al., 2019). Despite the recent success of DNA-based approaches, there is still no large-scale adoption for terrestrial systems as validation is needed to assess how accurate these methods are for routine monitoring.

HTS approaches have been extensively used for insect identifications (Sigut et al., 2017; Yu et al., 2012; Zhou et al., 2013). HTS-based DNA barcoding is an approach where DNA barcoding is coupled with HTS: a polymerase chain reaction (PCR) step is used to amplify a region of interest for a single target taxon and thousands of insect specimens can be simultaneously loaded on a HTS platform (Srivathsan et al., 2021). This approach is well established and has increased dramatically the number of available barcodes in sequence databases (Shokralla et al., 2014). DNA metabarcoding refers to a similar approach in which DNA barcodes are used to identify whole communities from mixed samples in parallel, including bulk insect samples (such as a Malaise, pan or suction traps). DNA metabarcoding is still a developing tool yet it is routinely used for bulk species identification in large-scale studies as it facilitates higher throughput when compared with HTS barcoding or metagenomics (Gueuning et al., 2019). The most commonly used gene for insect identification is cytochrome c oxidase subunit I (COI) for both HTS barcoding and metabarcoding, although other markers are also gaining popularity (Marquina et al., 2019). Metabarcoding facilitates more comprehensive species identification than traditional morphological approaches and, importantly, it scales much more efficiently both in terms of costs and time (Ji et al., 2013). Metabarcoding costs are continuing to decline and manual sample sorting is typically not required, although it can have advantages (e.g., greater detectability of diversity; Majaneva et al., 2018). The scalability of the approach makes it an ideal tool for rapid biodiversity assessment but also a tool for rapid diagnostics of pests or non-native species of economic importance (Kitson et al., 2019; Piper et al., 2019).

DNA metabarcoding of insect samples is not without its problems though. It is susceptible to contamination and biases introduced by PCR that can cause: (a) mis-identifications, (b) amplification of non-target taxa, and (c) primer-template mismatches that minimize the potential for metabarcoding results to be quantitative (i.e., infer abundance information from reads; Krehenwinkel et al., 2017). Alternative PCR-free approaches such as metagenomics circumvent the PCR amplification step and its associated biases by instead sequencing whole genomes or large sections of them. Whilst metagenomic approaches are state-of-the-art, they require high-quality DNA (Ji et al., 2020), which is likely to be difficult to obtain from highly degraded archive samples, and they are often prohibitively expensive (Gueuning et al., 2019), making them inaccessible to many monitoring schemes.

Currently, one of the greatest impediments to using metabarcoding for biomonitoring is the destruction of samples in the DNA extraction process, which is often necessary to yield high quality DNA. Advances in non-destructive sample processing, however, show that such approaches can be comparable to tissue homogenization (Carew et al., 2018). Methods can vary considerably from quick tissue digestions to extraction of DNA from the preservative within the sample, but most of the approaches depend on tissue digestion for a minimal amount of time, whereas standard lysis of tissue takes place over minutes or hours (Batovska et al., 2021; Zizka et al., 2018). Non-destructive extractions can introduce additional biases including the increased impact of morphological characteristics such as sclerotization on the detection of taxa within samples (Martoni et al., 2022). Despite these biases, the benefits of such approaches can overcome the drawbacks where preservation of specimens is important (i.e., when the aim is establishment of a species inventory like in many long-term monitoring schemes).

Here, we develop and evaluate the application of DNA metabarcoding to a historic insect monitoring scheme in the United Kingdom: the Rothamsted Insect Survey (hereafter RIS; Harrington, 2013). RIS has been monitoring aphids and moths since the 1960s using national networks of suction and light traps. RIS aims to inform farmers of the timing and magnitude of aphid migrations to prevent heavy prophylactic use of insecticides. We focus on the aphid fraction of suction trap samples over a 16-year time-series (2003–2018). Upon collection of RIS suction trap samples, aphids are morphologically identified to species level and subsequently archived. Our aims are: (a) to establish a non-destructive metabarcoding approach to process historical stored samples; (b) assess the potential of DNA metabarcoding to identify historical aphid samples; (c) compare data between morphological identification and metabarcoding; (d) identify whether the age of samples and sequencing depth limit data quality when using metabarcoding; and (e) assess the added value that molecular approaches offer to insect monitoring schemes by unlocking previously untapped resources of insect specimens.

MATERIALS AND METHODS The Rothamsted Insect Survey suction-trap samples

The suction trap network currently comprises 16 (12.2 m) tall traps (12 in England, 4 in Scotland, see Figure 1) that continuously collect flying aphids to estimate their aerial density and provide daily records during the main aphid flying season (April–November), and weekly records year-round (https://insectsurvey.com/). The network has been continuously operational since 1964. Just over 400 of the 600 recorded British aphid species have been recorded from these samples to date. Samples contain both aphids and “bycatch” (i.e., non-aphid taxa), and are archived and made available for further research. Only the aphid fraction of these samples has been consistently identified to species level. Aphids from 1968 to 2002 have been cleared for identification purposes in a formalin solution that removed internal tissues, rendering these samples unviable for DNA analysis. From 2003 onwards, however, samples have been preserved in 95:5 ethanol:glycerol solution which effectively preserves DNA (Kagzi et al., 2022). All “bycatch” samples have been stored in the ethanol:glycerol solution across all of the years (1968-present). For this reason, we focused on a subsample (of the aphid samples) for a 16-year time-series (2003–2018) from a single suction trap at Cockle Park, Morpeth (Northumberland, England, UK) hereafter referred to as Cockle Park trap, where aphid samples have been stored at room temperature.

View Image - FIGURE 1. The RIS suction trap network with 16 suction traps across the United Kingdom (trap locations denoted by base of traps). The one used in this study “Cockle Park trap” is highlighted in red with an “N” at its center.

FIGURE 1. The RIS suction trap network with 16 suction traps across the United Kingdom (trap locations denoted by base of traps). The one used in this study “Cockle Park trap” is highlighted in red with an “N” at its center.

Sample collection

We selected two samples randomly from each month of material archived between May and October for the years 2003–2018, imposing use-case criteria for the number of aphids within those samples. Specifically, sampled dates needed a total number of aphids within one standard deviation of the overall aphid mean count of the corresponding month. This was stipulated for logistical reasons (i.e., loading all samples in a single sequencing run with sufficient coverage) but also to avoid samples with extreme numbers of aphids within tubes (>300). Note that most samples within the Cockle Park trap had fewer than 100 individuals, with few exceptions (see Table A1 in Appendix S1). This facilitated the use of similar reagent volumes across the whole experiment. The resulting time series includes 67% of the genera (68 genera, 98 species) found in the complete daily time series of the Cockle Park trap between 2003 and 2018, which included over 2500 samples (101 genera). This was deemed representative in terms of species coverage across the series. A total of 183 samples were used, equating to approximately 12 samples per year (split into two datasets to compare lysis protocols, described below).

Non-destructive extraction

As RIS wishes to retain insect samples for future research, we aimed to extract the DNA non-destructively by reducing lysis digestion durations. Damage to specimen tissue is dependent on the time that the tissue is digested for and the lysis buffer used (Carew et al., 2018). We used a magnetic bead-based protocol (Oberacker et al., 2019) with modifications to the lysis volume to adjust it for different sample volumes (in terms of numbers of aphids; Table A1 in Appendix S1). To establish the minimum time required for the lysis digestion before damage to the tissue became visible, five additional samples were selected to test three different lysis digestion times: 1, 2, and 6 h (See Table A2 in Appendix S1), with amplification success confirmed via gel electrophoresis. Morphological damage was assessed via light microscopy in accordance with taxonomists at RIS. From those samples, the 1 h digestion did not amplify, but samples with 2 h digestion showed almost no morphological damage and amplification success was high. Finally, samples for 6 h showed slight tissue digestion damage and high amplification rates (Figure A1 in Appendix S1). For the remaining 183 samples of the time series, we split them randomly into two datasets using the “sample” function in base R (v.4.0.1, R Core Team, 2021): 92 samples were extracted with a 2 h digestion and 91 with a 6 h digestion. This was done to assess any influence of digestion time on overall results. Samples in each treatment therefore comprised approximately six dates per year, together representing each month of the sampled period (12 per year in total for both datasets, except for missing dates due to occasional trap inactivity). The initial digestions were carried out in 1.5-mL tubes, after which 62 μL of lysate was transferred to 96 well plates (irrespective of initial lysis volume) to standardize volumes for the remainder of the protocol, which followed Oberacker et al. (2019) from that point. Finally, for each plate we included a DNA extraction positive and a DNA extraction negative. The DNA extraction positives was tissue from two ichneumonid parasitoid wasps belonging to the genera Acrolyta and Gelis. The DNA extraction negative included all reagents used for the DNA extraction and molecular grade water in place of lysate. Following completion, the yield of all DNA extractions was quantified on a Qubit 4 (Thermo Fisher Scientific) with the 1× High Sensitivity dsDNA assay.

PCR amplification and library preparation

For DNA amplification, we followed the nested-tagging method of Kitson et al. (2019) which uses a combinatorial indexing approach to multiplex samples in a single sequencing run. We amplified a 313-bp fragment of the COI gene with the primers mLCOintF and jgHCO2198 (Leray et al., 2013). Primer sequences included molecular identification tags (8 bp), heterogeneity spacers (see Kitson et al., 2019 for details) and bridge sequences for indexing PCR primers. PCRs were carried out for 40 cycles (95°C for 45 s, 51°C for 15 s, and 72°C for 45 s) in 20 μL reactions using a high-fidelity Taq mastermix (MyFi Mix Bioline), 2 μL of template DNA, and each primer (final concentration at 0.5 μM). To prevent cross contamination, wells were sealed using a drop of mineral oil (~20 μL) before all other reagents and template DNA were added. Two PCR controls (in addition to the above DNA extraction controls) were used per plate: a PCR-positive control (DNA extracted from a moth belonging to the genus Operophtera) and a PCR-negative control which included all PCR reagents but substituted template DNA with molecular biology grade water.

PCR success was confirmed via gel electrophoresis using 5 μL of PCR product in 1.5% agarose gels. No bands were visible for PCR negatives nor DNA extraction negatives. We then conducted a magnetic bead-based normalization using 0.6:1 ratio of 0.1× Solid Phase Reversible Immobilizations (SPRI) beads (9 μL:15 μL beads:PCR product). After purification and prior to library preparation, samples were pooled in groups of 16, for which 4 μL was taken from each sample to form each pre-library. This process generated 12 libraries (6 for each plate). This further increased sequencing diversity during the initial cycles of the sequencing run as suggested by the sequencing facility (Genomics Core Facility, Newcastle University). To prepare each of these libraries, we carried out a second PCR (PCR2) with 12 cycles (95°C for 45 s, 51°C for 15 s, and 72°C for 30s) and a final extension step of 5 min at 72°C in 20-μL reactions using 5 μL of each pooled library, the same Taq (MyFi Mix Bioline) and each of the respective Illumina N5 and N7 adapters (at a concentration of 1 μM). For each library, a PCR2 negative was also included which comprised the same reagents, but with DNA substituted with molecular biology grade water. All libraries and PCR2 negatives were checked via gel electrophoresis and no bands were visible for any of the negatives. We then purified the PCR2 products to remove fragments smaller than the target amplicon using 0.6:1 ratio of SPRI beads to template (9 μL:15 μL). After cleaning the libraries, successful amplification and purity was checked on an Agilent TapeStation 4200, and libraries pooled equimolarly at approximately 7.6 ng/μL. The pooled final library was then sequenced on an Illumina MiSeq using a V3 (2 × 300) kit with 500 cycles (2 × 250) and a control library PhiX (a control library derived from a well-characterized bacteriophage genome used to assess sequencing quality) at 10%. The sequencing took place at the Genomics Core Facility at Newcastle University.

Bioinformatic analysis

Sample demultiplexing within individual libraries was carried out using the software MetaBEAT (https://github.com/HullUni-bioinformatics/metaBEAT). Only reads with used tag combinations were retained. Other analyses were conducted in R (v.4.0.1, R Core Team, 2021) except if stated otherwise. The demultiplexed data were processed using package DADA2 (Callahan et al., 2016), removing primers using cutadapt v1.18 (Martin, 2011). DADA2 filtered and trimmed sequences based on read quality, removing any reads with ambiguous “N” bases with the “filterandtrim” function. We then merged paired-end reads and removed chimeras with the “removeBimeraDenovo” function. Finally, we inferred amplicon sequence variants (ASVs) within DADA2. ASVs are single DNA sequences and can also be considered as haplotypes. They provide a finer resolution for species distinction and include intraspecific variance unlike other approaches. ASVs offer certain advantages against operational taxonomic units (OTUs) particularly when it comes to reproducibility between studies (Callahan et al., 2017). All functions within DADA2 were used with default arguments. Taxonomy was assigned using the blastn program via the command line (Camacho et al., 2009) with a curated database for all Metazoa downloaded from the MIDORI database (MIDORI2_UNIQ_NUC_GB259_CO1_BLAST) (Leray et al., 2022). To validate the completeness of the reference database for the taxa in question, we checked whether all the species and genera found in the morphological dataset were present in the MIDORI2 database. To assign taxonomy we kept only the top hit for each ASV and only those with a uniquely assigned taxonomy for each ASV as the top hit (ASVs with more than one taxon assigned were discarded). More specifically, we kept only hits that had more or equal 95% query cover and 97% identity. For ASVs which could not be assigned at the species level, only genus level information was kept. Helper functions from the packages Taxreturn (Piper, 2023) and Biostrings (Pagès et al., 2023) were used for processing and parsing the output from DADA2 and blastn (see the associated Github page for details: https://github.com/DimiPetsop/Archival_aphid_metabar).

Statistical analyses

Statistical analysis was conducted in package vegan (Oksanen et al., 2022). To assess whether the non-destructive DNA-extraction method yielded similar data to the destructive method, we performed two ANOVAs with the “aov” function in R. The number of sequencing reads for each sample from the two treatments were compared via ANOVA with the reads log-transformed using the “log” function in R. Congruence (percentage of common species detected and Jaccard's similarity see below) was compared between the two treatments in separate ANOVAs.

To assess the sensitivity and accuracy of DNA metabarcoding, we assessed congruence between the morphological (hereafter MOTA) and the metabarcoding (hereafter META) datasets. Congruence was calculated using two methods: congruence as detectability (subjective: % taxa identified morphologically that were detected by metabarcoding) and congruence as similarity (objective: % taxa shared by both methods irrespective of morphological identification). We used base set functions in R (“intersect,” “setdiff”) to identify the frequency of shared taxa (and what percentage of total detections these comprised) for two taxonomic levels: genus and species. Since these functions use character strings, we first standardized taxonomic annotations across the two datasets using the tidyverse package (Wickham et al., 2019). To visualize the differences, we plotted Venn diagrams with the eulerr package (Larsson, 2022) in R, for distribution plots we used the ggdist package (Kay, 2023). To assess whether the reference database completeness had any influence on congruence, we also calculated it after removing taxa not found in the database. To compare congruence as similarity of the two methods (i.e., based on shared detections including taxa not morphologically identified), we calculated Jaccard similarity in R and visualized the extent of difference via non-metric multi-dimensional scaling (NMDS) using the “metaMDS” command in the package vegan.

We applied minimum sequence copy thresholds to the read counts of each sample to remove potential false positives (e.g., cross contamination, tag jumping, and sequencing errors) that might have resulted from RIS sample handling, laboratory contamination, and/or sequencing/bioinformatic errors (Drake et al., 2022). Read counts below the percentage threshold applied of the total reads for each sample were discarded, with thresholds of 0.2%, 0.5%, and 1% of sample read counts used and compared. The same thresholds were used for both the species and genus level data. Congruence as similarity based on the Jaccard index was calculated for both pre-filtered and filtered datasets. To identify the number of false positives and negatives within the pre-filtered and filtered datasets we counted species mismatches between MOTA and META. Aphid taxa within each sample that were identified by META but not MOTA were treated as false positives, whereas false negatives were aphid taxa not found by META. “Bycatch” taxa (i.e., non-aphid taxa) were excluded. To assess whether sequencing reads were correlated with the abundance captured in the traps according to the morphological dataset, we performed a linear regression between these two variables in R after log-transforming both read and aphid counts. Finally, to understand which of the factors influenced congruence we used a binomial generalized linear model (GLM) with congruence as a response variable and year (both congruence as detectability and similarity), sequencing depth (log-transformed reads) as predictor variables using the “glm” function in R accounting for possible interactions between the year and sequencing depth (model formula: Congruence ~ year + sequencing depth + year:sequencing depth, family = “binomial”).

RESULTS PCR success and sequencing results

Overall, 173/183 samples produced a visible band on a gel (i.e., 96% PCR success). The run produced 23,616,958 reads (including unassigned reads from PhiX), which we reduced to 16,701,112 reads after demultiplexing, and 9,069,035 reads after filtering, denoising, merging, and chimera removal. Variability between samples was high with 484–256,126 reads per sample with a median of 32,979 (1st quartile: 16,466, 3rd quartile: 71,739) and a mean of 48,498 (± 44,958). Unsurprisingly, the samples with the lowest number of reads were the ones for which amplification was not evident on an agarose gel (10 samples with reads less than 4000). Additionally, no reads from the DNA extraction or PCR negatives passed any initial filters, and taxa used as PCR positives were not found in any of the other samples, suggesting minimal cross contamination. There was no significant difference in the number of reads between digestion treatments (ANOVA: F_1,181 = 0.47, p = 0.49).

Morphology versus metabarcoding

The MIDORI2 database used included sequences for all genera but two (out of 68): Tubaphis and Mimeuria, whilst, for species, 16 (out of 98) did not have sequences in the database. To ensure that the lack of species sequences did not affect the number of detections considered to be false positives and negatives, we also calculated these by removing those species from the morphological dataset (see Tables A7 and A8 in Appendix S1). In total, 8,577,970 (94%) reads were assigned to genus or species level. Reads that were discarded (~6%) belonged to ASVs that were either not assigned at all (~0.3%), discarded during filtering (~4.7%) or had multiple taxonomic matches for the top ASV (118,657 reads or ~1%). Out of all assigned reads 8,450,770 (98.5% of total assigned reads) reads were assigned to Hemiptera (families: Adelgidae, Anoeciidae, Pemphigidae, and Thexalidae) with BLAST: 8,450,753 were assigned to genus level and from those 2,900,789 (34%) were assigned to species level and only 127,200 (1%) reads were assigned to non-target taxa. META detected 56 unique genera and 94 unique species. In comparison, MOTA included 68 unique genera and 98 unique species (76% congruence for genera and 54% for species, see Figure 2). Of the 16 genera not identified by META, only two had more than three individuals across the time series: Rhopalosiphoninus and Mindarus with six and four individuals, respectively. The remaining genera unidentified by META had less than three individuals across the time series, with half having less than two individuals (Table A3 in Appendix S1). Most of the genera (11 out of 16) were present in the blastn output file, but these were either not selected as the top hit (Illinoia and Nasonovia) or were filtered out due to having values (percent identity and query cover) lower than the thresholds set (see Table A3 in Appendix S1). The two dominant genera in the metabarcoding dataset were Drepanosiphum and Rhopalosiphum comprising 57% and 13% of total reads, respectively, which resembled the morphological dataset. However, Drepanosiphum (13% of the total aphid counts) were most abundant in the metabarcoding dataset whilst Rhopalosiphum (41% of the total aphid counts) were most abundant in the morphological dataset (Table A4 in Appendix S1). There was no significant difference in congruence for either of the metrics used at the 0.05 significance level between the destructive and non-destructive extraction protocols (see Table A5 in Appendix S1).

View Image - FIGURE 2. Top: Venn diagrams for RIS data generated by morphology and metabarcoding at the genus level (left) and species level (right). Values represent the number of taxa detected by each method or by both. Bottom: Congruence, given as a percentage, based on detectability of taxa identified morphologically across years.

FIGURE 2. Top: Venn diagrams for RIS data generated by morphology and metabarcoding at the genus level (left) and species level (right). Values represent the number of taxa detected by each method or by both. Bottom: Congruence, given as a percentage, based on detectability of taxa identified morphologically across years.

Congruence between metabarcoding and morphology: Detectability and similarity

The percentage of congruence based on detectability between both datasets showed high variability with a mean of 76.7% (± 19.9) for genera. Between years the average percent of congruence varied from 66% in 2004 to 85% in 2011 (Figure 2; Table A6 in Appendix S1). For species-level analyses the mean was only 47% across all years (Figure A2 in Appendix S1). Based on congruence as similarity, mean congruence was 46% at the genus level and 28% at the species level. After applying minimum sequence copy thresholds at the genus level, congruence as detectability changed drastically from a mean of 76% to 52% for the most stringent threshold of 1% (Figure 3). At the species level, a similar pattern was observed with congruence as detectability falling from 47% to around 30% for the 1% threshold. Congruence as similarity remained largely unchanged after applying the thresholds (Figure 3), with a slight increase at genus level at the 0.2% threshold, from 46% to 50% (Figure 3). After removing the species not found in the reference database, patterns remained largely unchanged with congruence as detectability at 49% when no filtering was applied and 31% for the 1% threshold, whilst for similarity the mean remained at 28% when no filtering was applied. NMDS showed that morphological and metabarcoding detections were more similar in some years than others, but no clear interannual pattern emerged (Figure 4). The mean number of taxa considered to be false positive in META was 3.6 whilst there was a mean of 1.8 false negatives. Using minimum sequence copy thresholds, the mean number of false positives fell to 0.6 with a 1% threshold and 1.4 with a 0.2% threshold. The number of false negatives increased to 2.7–3.7 depending on the threshold (Figure A3 in Appendix S1). At the species level, there were 4.3 false positives and 3.7 false negatives on average, and minimum sequence copy thresholds elicited similar overall patterns as for the genus-level data (Figure A3 in Appendix S1). The exclusion of the 16 taxa not found in the reference database made no apparent changes to the mean number of false positives and negatives before and after applying the thresholds (see Tables A7 and A8 in Appendix S1).

View Image - FIGURE 3. Congruence as detectability (top) and as similarity (bottom) with different minimum sequence copy thresholds applied. Both density and boxplots are shown with jittered individual sample points.

FIGURE 3. Congruence as detectability (top) and as similarity (bottom) with different minimum sequence copy thresholds applied. Both density and boxplots are shown with jittered individual sample points.

View Image - FIGURE 4. NMDS plot showing the difference in taxonomic composition (at genus level) of samples for each year for metabarcoding (META) and morphological (MOTA) datasets. The difference between methods varied in magnitude across time.

FIGURE 4. NMDS plot showing the difference in taxonomic composition (at genus level) of samples for each year for metabarcoding (META) and morphological (MOTA) datasets. The difference between methods varied in magnitude across time.

Overall, there was a significant positive correlation between the number of individuals in MOTA with the reads from META (R² = 0.64, p = 9.9⁻¹³; Table 1; Figure A4 in Appendix S1). Yet, neither measure of congruence significantly differed over time, sequencing depth, nor the interaction between time and sequencing depth (Table 2).

TABLE 1 Results for the linear regression between morphological counts and read counts from metabarcoding (both after log-transformed).

Coefficients	Estimate	Standard error	t-Value	Pr (>t)
Intercept	−5.86	0.38	−15.27	<2e-16***
Counts morphology	1.01	0.10	9.46	9.95e-13***

^***Significance at the 0.0001 level. Adjusted R²: 0.64.

TABLE 2 Results from the binomial generalized linear model for factors explaining congruence for detectability and Jaccard's similarity in parentheses.

Coefficients	Estimates (Jaccard)	Standard error (Jaccard)	z-Value (Jaccard)	Pr (>z) (Jaccard)
Intercept	583.56 (451.92)	837.89 (751.65)	0.69 (0.60)	0.48 (0.54)
Year	−0.29 (−0.22)	0.41 (0.37)	−0.69 (−0.60)	0.49 (0.54)
Sequence depth	−59.58 (−42.83)	80.56 (71.44)	−0.74 (−0.59)	0.46 (0.54)
Year:Sequence depth	0.02 (0.02)	0.04 (0.03)	0.74 (0.60)	0.45 (0.54)

Non-aphid taxa in sequences

BLAST assigned 8,450,770 (94% of total reads) reads to Hemiptera which comprised the target taxa (excluding families: Lygidae and Myridae). A total of 126,100 (~1%) reads were assigned to other Arthropoda taxa, and 1100 reads were assigned to Chordata and Ascomycota. Reads that were assigned to non-target taxa included possible contaminants from RIS (Human and bird DNA) and other arthropod taxa commonly found within the samples before aphids are separated. The most abundant non-aphid orders were Diptera (64 samples), Hymenoptera (61), and Diplostraca (10). In the case of Hymenoptera, over 30% of our samples had reads of aphid parasitoids. Certain samples were inspected for the presence of non-aphid taxa such as Diplostraca (Daphnia magna) due to the high number of reads attributed to these spurious non-target taxa deemed unlikely to occur within RIS suction traps. These taxa were not found and therefore it is uncertain if this is environmental, handling, or laboratory contamination. The trap, however, is situated near water bodies. Of the 10 samples that had Daphnia manga reads, one contained 97% (24,333) D. manga reads. Whilst other arthropods including Diptera, Thysanoptera, and Araneae are likely to be legitimate detections from the suctions traps, this study was concerned with the identified target aphid taxa and the congruence between morphology and metabarcoding in detecting these aphids.

DISCUSSION

In this study, we have demonstrated that DNA metabarcoding can successfully identify insect species from long-term monitoring archive samples, and that this can be achieved non-destructively. We identified aphids (and other species) that have been archived for more than 18 years (albeit with varying success between years and taxonomic levels) despite the suboptimal storage conditions of the RIS collection which were not primarily intended for DNA preservation (see below). Morphological identification fared slightly better than DNA metabarcoding when assessing the congruence of metabarcoding with morphological identifications, which highlights the benefit of corroborating and combining data across the two approaches (Keck et al., 2022). We do, however, demonstrate that it is possible to recover over 76% of genera and 54% of species within our time series with DNA metabarcoding alone, with reduced reliance on taxonomic knowledge and expedited processing times. Our study further highlights the added value of non-destructive DNA-based approaches for analyzing archival samples from insect monitoring schemes and the importance of such collections, but also their limitations.

A non-destructive approach for collections

To obtain high-quality DNA, destructive methods are usually applied, which has often prohibited processing of monitoring scheme collection samples (Raxworthy & Smith, 2021). This is especially true for older specimens in which the DNA has degraded for many years after sampling, even under optimal preservation conditions. RIS samples are stored in 95:5 ethanol:glycerol solution at room temperature, which, while cost-effective, is not ideal for DNA-preservation. We do, however, show that a non-destructive DNA extraction approach can still be accurately used for sample identification irrespective of sample age with no observed detriments compared to a destructive approach. This is in line with other research advocating non-destructive methods as an alternative for DNA metabarcoding (Martoni et al., 2022).

The approach used here relies on “quick” digestion of samples without noticeable external morphological damage. The yield of DNA extracted is typically influenced by the time of digestion and sclerotization of the specimens themselves (Carew et al., 2018). Aphids are soft-bodied insects and we found that a 2-h digestion was sufficient for species detection and reduces sample processing time. For more complex communities such as RIS “bycatch” samples, where insect sclerotization will vary greatly, longer digestions might be needed. Reducing the time needed from collection to identification can be vital for DNA-based monitoring; for example, in the case of invasive species (Piper et al., 2019). It also importantly safeguards the continuity and reusability of archival samples and allows retention of voucher specimens for post-hoc morphological confirmation of molecular identifications or collection of morphological trait data. There are other approaches that are even faster (see Batovska et al., 2021) than that presented here, but their efficacy with degraded samples is unknown without further validation.

Looking back in time: Taxonomy versus metabarcoding

There was no clear relationship between the congruence of taxonomy and metabarcoding and time (Figure 2), with relationships differing markedly between years and the lack of pattern being true for both congruence metrics used. This is important for collections-based research and particularly for RIS which has been archiving samples since the 1960s. Here, we successfully analyzed samples from one trap across a 16-year period. The aphid fraction had already been identified as is the case for all of the aphid fractions in the RIS archive. The daily catches of aerial insects from all 16 traps are collected and archived, and most of the insects archived remain unidentified due to the huge effort morphological identification would require. Since long-term data are lacking despite urgent need to monitor biodiversity loss in the wake of major global change, unlocking the potential of those samples via DNA metabarcoding would open new avenues for insect decline research (Petsopoulos et al., 2021) and fill gaps in insect species population and distribution records that remain unknown.

The morphological dataset contained 16 genera unidentified by metabarcoding. This could be due to database coverage and technical limitations such as PCR bias and primer-template mismatches (Alberdi et al., 2018). In this study, most genera (66 out of 68) were present in the database used and 11 of those unidentified by metabarcoding were discarded during the assignment due to the constraints imposed or, in most cases, the taxa were not selected as the top hit. Although our study did not check barcode quality, the MIDORI2 database is an already curated database with strict quality controls that updates regularly (Leray et al., 2022). PCR bias and primer-template mismatches are known to be problematic for aphids using many conventional PCR primers (Batovska et al., 2021) but a full pairwise PCR primer comparison was beyond the scope of this study and various aphid-specific primer pairs are already available (Ammann et al., 2020). Unidentified genera in the metabarcoding dataset mostly included rare taxa (in abundance of 1–2 individuals in samples found, with 0.4% of total aphid counts throughout the sampled time series) which can be difficult to detect when the samples are dominated by other species. At the species level, congruence was lower with both of the metrics used. However, congruence remained largely unchanged after removing all species from the morphological dataset that were not found in the database, suggesting that, at least for this study, database coverage is unlikely to be a significant factor affecting congruence. We must note, however, that the resolution of the barcode region chosen here has not been thoroughly investigated for aphids with in silico validation, but it is considered to be one of the most common regions for identifying European aphids (Coeur d'acier et al., 2014). To validate further if this is due to biases in the metabarcoding workflow or simply misidentifications in the morphological dataset, comparison of the methods along with single-specimen DNA barcoding is needed.

Surprisingly, sequencing depth and year were not found to affect congruence significantly, meaning other factors not assessed in this study are more influential. Approaches that partition bias throughout protocol steps (Martoni et al., 2022) could potentially identify influential factors such as primer mismatch, biomass or insect sclerotization. Our study was not, however, designed to address this. Finally, there was a strong correlation between sequencing reads and counts when compared across the whole dataset, with the two dominant taxa (Drepanosiphum and Rhopalosiphum) representing the most abundant taxa in both the metabarcoding and morphological dataset. We did show however that Drepanosiphum was the most abundant in META whilst Rhopalosipum most abundant in MOTA. This might have been a result of differences in body size or amplification biases which are known to affect the quantitative potential of metabarcoding datasets (Lamb et al., 2019) and might be even more apparent for the “bycatch” available at RIS (Petsopoulos et al., 2021). Overall, however, our study demonstrates that metabarcoding archived bulked samples shows considerable potential for unlocking insect time-series data.

False positives versus false negatives

The sample handling process at RIS, which pre-dates the advent of molecular ecology, means that some cross contamination is inevitable. Given its sensitivity due to the PCR amplification step, metabarcoding is usually very prone to this type of contamination. Since aphids were physically separated from other taxa, theoretically this should also have limited the detectability of non-target taxa. Even with this treatment, however, reads were assigned to other arthropod taxa and common laboratory contaminants (e.g., human). No reads were found in our negative controls, likely due to the stringent measures employed to prevent contamination during the metabarcoding workflow; therefore, we believe that this contamination arose prior to DNA-based analysis and can likely be attributed to sample handling in RIS. Some taxa found in our study besides aphids include commonly trapped insects in RIS and, in rare cases, after re-examination of the samples under a microscope, detected insects like chiromonids and thrips were found in the aphid samples.

Of particular interest were braconid parasitoids, of which seven aphid parasitoid genera had reads in more than 30% of the samples. This could either be contamination from the “by-catch” fraction of the samples before the aphids were separated into different tubes, or may represent detection of parasitism of flying aphids which is known to occur (Walton et al., 2011). If parasitized aphids are present within the tubes then there is a unique opportunity to construct long-term host–parasitoid interaction networks (Petsopoulos et al., 2021). There are opportunities to identify likely interactions based on probabilistic species co-occurrence analysis, but co-occurrence is not strictly evidence of interaction (Blanchet et al., 2020). Confident validation of these interactions would require a different approach to that presented here, with single aphid individuals processed via high-throughput DNA barcoding to validate whether parasitism is detectable within individuals, or if these detections are simply contamination.

Perhaps the most limiting factor in this study is cross contamination between aphid species themselves. The DNA metabarcoding dataset generated did, in some cases, detect more aphid species than the morphological identification, or identified completely different species. In 125 samples, for example, DNA metabarcoding detected more unique genera than the morphological identification. Examples of genera not identified morphologically include: Pachypappa, Pineus, Hyalopteroides, and Adelges. Some of which have been found at traps in RIS like Adelges but have not been identified from the samples analyzed here. In our case, when the samples have already been identified, we can inform our decision to categorize detections as false positives. By then applying minimum sequence copy thresholds (here applied as a percentage of reads within a sample), the prevalence of these false positives can be minimized (Drake et al., 2022). This process did reduce the taxa identified only by metabarcoding (likely false positives), but also significantly reduced congruence between the datasets as rare true positive taxa were also lost (false negatives, Figure A3 in Appendix S1). False negatives are a significant issue when applying filtering thresholds to metabarcoding datasets and can be just as problematic as false positives for ecological interpretation (Littleford-Colquhoun et al., 2022). Given this need for nuanced approaches to data filtration, measured approaches are emerging, the most important requirement of which is inclusion of stringent experimental controls (González et al., 2023). Identifying robust standardized methods for data filtration with an appropriate balancing of false positives and negatives is an urgent need for the establishment of rapid, repeatable and robust metabarcoding-based biomonitoring.

Further research could validate whether the presumed false positives detected in this study are actual contamination or if they represent true incongruence between morphological and metabarcoding datasets by processing individual aphids within samples. Although there are approaches that can further minimize or identify contamination, such as including technical replicates (e.g., PCR replicates; Yang et al., 2021), the contamination in RIS has likely been introduced during sample processing. This is a significant challenge in the application of metabarcoding to archival samples which have not been processed or stored with this application in mind. Another potential solution to this problem would be avoiding the PCR step altogether. Metagenomics, by circumventing the PCR amplification step, can achieve multi-taxon identification based on whole-genome sequencing. Ji et al. (2020) apply a metagenomic approach to another insect monitoring scheme that also “suffers” from the same type of contamination within archival samples. Whilst metagenomic approaches are state-of-the-art, they require high-quality DNA, which, although difficult to obtain from highly degraded archive samples, we have shown to be amplifiable from even the oldest samples in this study (with a 313 bp amplicon) and therefore metagenomics could be possible for RIS samples and other archival material. This study highlights the overall potential of using HTS approaches on insect archival samples, with considerable applications for understanding insect responses to environmental change (Petsopoulos et al., 2021). RIS represents an archive of tens of thousands of daily bulk insect samples (with all “bycatch” samples stored in the same ethanol:glycerol solution) and therefore an unprecedented potential to construct time series for thousands of insect species. Whilst metagenomics may currently be prohibitively expensive to apply at that scale, the value of such data would be unquestionably high. With decreasing sequencing costs, these approaches could ultimately become a viable alternative with unprecedented potential for less biased and contamination-prone molecular identification with vast improvements to taxonomic resolution.

CONCLUSIONS

Our study is the first attempt to assess the efficacy of DNA metabarcoding for determining species identities of mixed samples from long-term stored aerial suction-trapped insects. We showed high congruence between metabarcoding and morphological identification across years using non-destructive methods, demonstrating the massive potential of metabarcoding for enhancing our understanding of long-term insect trends using archival samples such as those of RIS. The greatest limitation of this approach is sensitivity to historical contamination which likely arises from handling and processing of samples prior to widespread adoption of molecular methodologies. With this understanding, we can better inform such processes and reduce contamination of future samples by applying best practices. The archival collection of RIS includes thousands of unidentified insect bulk samples (“bycatch”) that could be processed using metabarcoding, unlocking decades of unobserved insect population trend data. The temporal (50+ years of daily samples) and spatial (16 locations across the entire United Kingdom) characteristics of this archive make RIS a treasure vault for insect research. Perhaps RIS is unique in that sense, but other insect monitoring schemes exist globally. Our study highlights how samples from such schemes can be explored in greater depth and breadth via non-destructive DNA metabarcoding. In this pursuit, however, we must remain cognizant of the persistent need for morphological identification data to maintain contextual information not available via molecular analyses, prevent data loss to overly stringent filtering thresholds and ground-truth molecular data for rapid, robust, and repeatable biomonitoring.

AUTHOR CONTRIBUTIONS

This study was conceptualized and designed by D.P., J.R.B., J.N.K., L.C., R.M.H., and D.M.E. D.P. performed all the molecular work and analyses and wrote the first draft. J.N.K., J.P.C., and D.M.E. also contributed to the interpretation of the data. All authors contributed to revising the first draft before submission.

ACKNOWLEDGMENTS

Firstly, we would like to acknowledge Alex Greenslade from the Rothamsted Insect Survey (RIS) team for providing details on the morphological species list at Rothamsted and Chris Shortall for helping stream lining sample collection from RIS. Secondly, we would like to acknowledge all the entomological team (past and present) at Rothamsted which has been identifying and archiving all these specimens for decades. Finally, we would like to acknowledge the UK Rothamsted Insect Survey, a National Capability, is funded by the Biotechnology and Biological Sciences Research Council under the Core Capability Grant BBS/E/C/000J0200. Dimitrios Petsopoulos was supported by a studentship funded by the Institute for Agri-Food Research and Innovation at Newcastle University (Studentship reference number: EJU/180494123).

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest.

DATA AVAILABILITY STATEMENT

The data presented in this manuscript are available online via Zenodo: https://zenodo.org/records/10995475.

Word count: 7092

Show less

© 2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Insect populations are declining in many parts of the world, but a lack of long-term monitoring data is impeding our ability to understand and mitigate the causes of insect biodiversity loss. Whilst high-throughput sequencing (HTS) approaches, such as DNA metabarcoding, have the potential to revolutionize insect biomonitoring through rapid scalable identification, it is unclear to what extent HTS can be applied to long-term stored insect samples. Archived insect samples could inform forecasting and provide valuable information regarding past changes to biodiversity. Here, we assess the efficacy of DNA metabarcoding to identify archived samples from the longest passive monitoring scheme in the United Kingdom: the Rothamsted Insect Survey (RIS). With a focus on aphids as the target taxa of a national network of suction-traps, we analyze a 16-year time-series of stored samples (2003–2018) using DNA metabarcoding from one of the RIS suction traps as an exemplar. We achieved this by using a non-destructive DNA extraction protocol, ensuring the integrity of archival samples for further studies. We compared the identities of aphids determined by both metabarcoding (as inferred amplicon sequence variants [ASVs]) and morphological identification and found that metabarcoding detected most genera with varying success (mean > 76%). When comparing the two methods objectively (i.e., including taxa not detected morphologically), however, congruence decreased (51%). We show that minimum sequence copy thresholds can minimize metabarcoding false positives, but at the expense of introducing false negatives, highlighting the need for careful data curation. Detectability of taxa identified morphologically and similarity between the two methods did not significantly vary over time, demonstrating the viability of metabarcoding for screening archival samples. We discuss the advantages and challenges of metabarcoding for insect biomonitoring, particularly from archival samples, including improvements to sample handling, processing, and archiving. We highlight the wider potential of HTS approaches for stored samples from insect monitoring schemes, unlocking the immense potential of global historical time series.

Details

Title

Identifying archived insect bulk samples using DNA metabarcoding: A case study using the long-term Rothamsted Insect Survey

Author

Petsopoulos, Dimitrios¹

; Cuff, Jordan P¹; Bell, James R²; Kitson, James J N³; Collins, Larissa⁴; Boonham, Neil¹; Morales-Hojas, Ramiro⁵; Evans, Darren M¹

¹ School of Natural and Environmental Sciences, Newcastle University, Newcastle-upon-Tyne, UK
² Rothamsted Insect Survey, Rothamsted Research, Harpenden, UK
³ School of Natural and Environmental Sciences, Newcastle University, Newcastle-upon-Tyne, UK; Fera Science Ltd, York, UK
⁴ Fera Science Ltd, York, UK
⁵ GenPax AG, England, UK

Section

ORIGINAL ARTICLES

Publication year

2024

Publication date

May 2024

Publisher

John Wiley & Sons, Inc.

e-ISSN

26374943

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/edn3.542

ProQuest document ID

3072688018

Identifying archived insect bulk samples using DNA metabarcoding: A case study using the long-term Rothamsted Insect Survey

Jump to:

Full text

Abstract

Details

Suggested sources