Differences in DNA sequence and structure among individuals within species, referred to as genetic variation, serve as the basis for key evolutionary mechanisms such as speciation and local adaptation (Barrett & Schluter, 2008). Genetic variation can be described as a wide spectrum of variants of various sizes, ranging from single nucleotide polymorphisms (SNPs) to larger structural variants (SVs), which may span megabase-long stretches of DNA or even whole chromosomes (Feuk, Carson, & Scherer, 2006; Mérot et al., 2020; Wellenreuther & Bernatchez, 2018). SVs such as insertions, deletions, duplications and inversions are now recognized as the main component of genetic variation, as they affect at least two to eight times more bases in genomes than SNPs (Catanach et al., 2019; Hämälä et al., 2021; Mérot et al., 2020). This estimate tends to increase as our ability to detect SVs from high-throughput sequencing data is constantly improving (Ho et al., 2020; Mérot et al., 2020).
SVs are also known to have a broad range of consequences at various biological levels. At the molecular scale, they may influence gene dosage, gene expression, DNA interactions and tridimensional structure by altering genetic elements' proximity and copy number (Feuk, Marshall, et al., 2006; Gamazon & Stranger, 2015; Spielmann et al., 2018). SVs that disrupt collinearity between homologous chromosomes, especially large inversions, are also likely to restrict or suppress recombination (Crown et al., 2018; Rowan et al., 2019). This may result in an apparent reduced gene flow around SVs, which may link co-adapted alleles, thus promoting the formation of supergenes that may underlie complex and adaptive phenotypes (Kirkpatrick & Barton, 2006; Rieseberg, 2001; Thompson & Jiggins, 2014).
A growing body of evidence also suggests that SVs can be involved in evolutionary mechanisms in various species (reviewed in Wellenreuther & Bernatchez, 2018). For instance, supergenes arising from large inversions have been linked to adaptive variation in wing color patterns in Heliconius butterfly (Joron et al., 2011), to migratory behavior in rainbow trout (Oncorhynchus mykiss; Pearse et al., 2014; Pearse et al., 2019) and Atlantic cod (Gadus morhua; Kirubakaran et al., 2016; Berg et al., 2017), as well as to reproductive strategies in the ruff (Philomachus pugnax; Küpper et al., 2016) and the white-throated sparrow (Zonotrichia albicollis; Tuttle, 2003). Other key examples of inversion polymorphism involved in ecotype divergence and local adaptation have been documented in the seaweed fly (Coelopa frigida; Mérot et al., 2018) and in three-spined stickleback (Gasterosteus aculeatus; Jones et al., 2012). Besides large inversions, copy number variants (CNVs) have also been linked to adaptation to local temperature regimes in American lobster (Homarus americanus; Dorant et al., 2020) and to glacial lineage divergence in capelin (Mallotus villosus; Cayuela et al., 2021). Industrial melanism in peppered moths (Biston betularia), a textbook example of rapid adaptation to environmental change, has been associated with an intronic insertion in the cortex gene (Van't Hof et al., 2016). Similarly, a 2.25-kb intronic insertion would explain color pattern divergence among lineages in the Corvus genus, promoting reproductive isolation and thus leading to speciation (Weissensteiner et al., 2020).
Despite such well-documented cases of adaptive genomic rearrangements, most SVs other than large inversions remain understudied in a population genomics context. Relative to SNPs, very little is known about how such a large component of genetic variation is distributed within and between wild populations. Indeed, while standard procedures and pipelines are available for population-scale SNP calling, SV detection and genotyping, on the other hand, involve significant challenges for large, multisample datasets (Ho et al., 2020; Mahmoud et al., 2019).
Calling SVs requires specialized software due to their complexity and diversity in type and length. SV callers rely on various signals of discordance in read mapping relative to a reference genome in order to infer SVs in a given sample (Lin et al., 2015; Mahmoud et al., 2019). Because short-read sequencing is widely available and affordable, it is an appropriate technology for population-scale study of genetic variation. However, the performance of short-read-based SV callers is highly variable (Cameron et al., 2019). They are known to lack sensitivity, as true positive detection rates can be as low as 10% (Huddleston et al., 2017; Sedlazeck, Rescheneder, et al., 2018). They also show low precision, with high false discovery rates reaching 89% for some datasets (Mills et al., 2011), especially for calls near SNPs, indels, low-complexity regions and repeats (Cameron et al., 2019). In fact, short reads are hard to map to the reference genome owing to their small length, especially when they include numerous sequencing errors, repeats (Sedlazeck, Lee, et al., 2018) or sequences differing considerably from the reference, such as SVs. Spurious mapping may result in underreporting of variation, an issue known as reference allele bias (Brandt et al., 2015; Nielsen et al., 2011). Calling SVs in a given dataset using multiple callers (ensemble calling), may increase the range of SV types and sizes detected or reduce false discovery rate compared to single-tool SV calling. For instance, SV callsets may be merged across callers, then filtered for calls supported by a given minimum number of tools (Auton et al., 2015). However, the improvement in sensitivity and/or precision strongly depends on the callers used in combination (Kosugi et al., 2019; Mahmoud et al., 2019).
By contrast, recent advances in third generation sequencing platforms (Oxford Nanopore and Pacific Biosciences' technologies) have brought significant improvements regarding SV calling. Long reads span kilobase-long segments of DNA, thus fully overlapping SVs and their breakpoints, which considerably facilitates read mapping (Sedlazeck, Lee, et al., 2018) and improves sensitivity of SV detection (Mahmoud et al., 2019), especially for novel insertions (Ho et al., 2020). Specialized algorithms and pipelines have been developed to process long reads and account for their length and higher sequencing error rate (Delahaye & Nicolas, 2021; Rang et al., 2018). However, high costs prevent long-read sequencing from becoming a routine tool for population-scale SV studies for species with large genomes such as salmonid fishes (usually around 3 Gb), which requires the sequencing of many genomes, namely, for accurate estimates of allele frequency.
To provide an adequate balance between accurate SV characterization and genotyping in large datasets, emerging hybrid approaches can be considered, such as pairing affordable short-read sequencing for all samples with high performance third generation sequencing for a small subset of genomes only. Candidate SVs called from long reads can then be genotyped in all samples from short-read data using pangenome graphs, which offer considerable advantages over conventional, linear reference-based methods. Indeed, in a reference pangenome, the reference genome is represented as a base graph structure where known variants and alternate alleles are encoded as alternate paths, i.e., series of nodes and links (Paten et al., 2017). The integration of known genetic variation within the reference greatly facilitates mapping of reads that overlap such variants, thus improving both SV detection and genotyping, and reducing reference allele bias (Ameur, 2019). This approach has shown promising results for genome-wide population-scale SV detection in human (Homo sapiens; Yan et al., 2021), soybean (Glycine max; Lemay et al., 2022), lake whitefish (Coregonus clupeaformis; Mérot et al., 2023) and in kākāpō parrots (Strigops habroptilus; Wold et al., 2023).
Knowledge pertaining to SVs remains minimal in salmonid fishes, despite their genomes being extensively studied for aquaculture applications. The first comprehensive catalog of genome-wide SVs for Atlantic salmon (Salmo salar) was produced by Bertolotti et al. (2020) by calling putative SVs using short-read-based caller LUMPY (Layer et al., 2014) in 492 wild and domestic salmon from various populations in Europe and North America. SV calls were then manually curated with SV-plaudit (Belyeu et al., 2018) in order to eliminate false positives, yielding 15,483 high confidence SVs matching the expected population structure. This study also revealed a subset of outlier SVs overlapping genes enriched for brain expression, suggesting an implication in salmon domestication. Other population-scale SV catalogs were published for the rainbow trout (Oncorhynchus mykiss; Liu et al., 2021), and two sympatric sister species of lake whitefish (Coregonus sp.; Mérot et al., 2023).
Further work is required in order to fully appreciate SVs' relevance in the genomics and biology of Atlantic salmon, which could serve as an ideal candidate species for studying adaptive SVs and developing an efficient population-scale SV discovery pipeline. SVs and larger chromosomal rearrangements are likely a key feature of salmonid genomes, as they are critical to the reploidization process following the salmonid-specific fourth vertebrate whole-genome duplication that occurred at least 60 million years ago (Ss4R) (Allendorf & Thorgaard, 1984; Crête-Lafrenière et al., 2012; Lien et al., 2016). Sequence repeats, which account for 50 to 60% of the Atlantic salmon genome (de Boer et al., 2007), are also known to promote SV formation (Levy-Sakin et al., 2019). Moreover, Atlantic salmon display considerable life history trait variation both within and between wild populations (Klemetsen et al., 2003). Consequently, there is considerable interest in understanding the genetic architecture of traits such as growth rate and disease resistance for aquaculture (Gjedrem & Rye, 2018), but also in the context of local adaptation, which usually involves such life history trait variation (Fraser & Bernatchez, 2005; Lu & Bernatchez, 1999; Taylor, 1991). Local adaptation is expected to be a major driver of population structure in Atlantic salmon, given its homing behavior (Allendorf & Waples, 1996) and the variability in habitat conditions (Kawecki & Ebert, 2004; Taylor, 1991). Indeed, the association between the genetic structure of seven groups of local salmon populations in Eastern Canada and regional rivers' environmental parameters suggests adaptive SNP divergence among these groups (Bourret et al., 2013; Dionne et al., 2008). Previous studies have also highlighted a few adaptive large chromosomal rearrangements in wild Atlantic salmon populations (Watson et al., 2022; Wellband et al., 2019), as well as divergent SVs between domestic and wild populations (see Bertolotti et al., 2020). However, SVs' contribution to local adaptation remains poorly documented among North American populations, especially at a finer geographic scale (e.g., within neighboring rivers).
Two parapatric Atlantic salmon populations from the Romaine and Puyjalon rivers (Québec, Canada; 50.306337, −63.795602; Figure 1b) represent a prime case of putative fine-scale local adaptation. Indeed, admixture analysis and fixation index calculation (FST = 0.036) based on microsatellite markers showed moderate differentiation between Romaine (RO) and Puyjalon (PU) salmon, despite their geographical proximity and habitat connectivity (Albert & Bernatchez, 2006). Furthermore, they exhibit different trade-offs in major life history traits: earlier age at smoltification and sexual maturity have been reported among wild Romaine salmon (Belles-Isles et al., 2004; Fontaine et al., 2000; WSP Global, 2019), as well as in wild-born Romaine salmon reared in a hatchery environment at the LARSA (Laboratoire de Recherche en Sciences Aquatiques; Université Laval, Québec), whereas wild-born Puyjalon salmon have shown higher growth rates over several cohorts in the same hatchery conditions (T. Dion, Chayer, et al., 2020; T. Dion, Langlois-Parisé, & Proulx, 2020; Langlois-Parisé et al., 2018; Therrien et al., 2017). The persistence of such life history trait variation among cohorts in both wild and controlled environments strongly suggests heritable genetic variation likely linked to local adaptation, as the Romaine and Puyjalon rivers differ in spawning habitat quality, substrate and hydrological parameters (Belles-Isles et al., 2004; Fontaine et al., 2000; GENIVAR, 2002; Schieffer, 1975; WSP Global, 2019). However, the genetic basis of this putative local adaptation has yet to be investigated.
FIGURE 1. Overview of (a) polymorphism detection pipelines used for population-scale characterization of structural variants (SVs), single nucleotide polymorphisms (SNPs) and small indels within the genomes of Romaine (RO) and Puyjalon (PU) salmon (SR: short reads; LR: long reads), (b) location of the Romaine and Puyjalon rivers (red dot) in Québec, Canada, and (c) comparative genomics analyses performed on catalogued variants (FST, fixation index; GO, Gene Ontology; PCA, principal component analysis; RDA, redundancy analysis).
Here, we address this lack of knowledge by proposing a multiplatform, graph-based SV discovery pipeline across numerous genomes (Figure 1a) in order to catalog genetic polymorphism in Romaine and Puyjalon salmon, allowing us to investigate candidate adaptive variation within these populations. With this approach, we primarily targeted small (50–1000 bp) to intermediate-sized SVs (<5 kb), as direct SV calling based on short reads and long reads is more accurate and powerful in this range of length (Mahmoud et al., 2019). This study thus served as an unprecedented opportunity to characterize SVs, SNPs and small indels in North American Atlantic salmon, as well as to explore the relative contribution of various forms of genetic variation to fine-scale adaptation and population differentiation.
MATERIALS AND METHODS Sampling,Manipulations involving fish were authorized by the Comité de protection des animaux de l'Université Laval (permit number: 2021–783). Adipose fin clips were sampled from 60 wild-born adult salmon raised as broodstock at Université Laval's Laboratoire de Recherche en Sciences Aquatiques (LARSA) and stored in ethanol until use. The samples comprised 31 Puyjalon (16 males and 15 females) and 29 Romaine (14 males and 15 females) individuals.
Spin column DNA extractions were performed using Qiagen's DNeasy blood and tissue kit according to the manufacturer's protocol, with the exception of the elution step, which was done twice per sample with 50 μL of water. DNA quality was assessed by concentration measurement and migration on 1% agarose gel. DNA samples were then diluted to 10 ng/μl and sent to Génome Québec's Centre d'expertise et de services (Montréal, Canada) for library preparation and whole genome sequencing on an Illumina NovaSeq6000, using four S4 PE150 lanes for an anticipated depth of 16X per sample.
Long readsAmong the 60 fish sampled for whole genome short-read sequencing, four (one male and one female for each population) were used for Nanopore long-read sequencing. In order to provide intact high molecular weight DNA, whole blood was extracted from live fish using EDTA-prefilled syringes, followed by humane euthanasia by decapitation. Blood samples were flash-frozen in liquid nitrogen, transferred to storage tubes and stored at −80°C until use.
High molecular weight DNA extraction was performed twice for each fish using Circulomics' CBB protocol for nucleated blood (EXT-NBH-001; Circulomics, 2021), from 6 μL of blood mixed with 194 μL of ice-cold PBS. DNA quality was assessed by measuring concentration with Qubit and migrating DNA on a 0.5% agarose gel. DNA samples were then sent to the Centre for Integrative Genomics (CIGENE) at the Norwegian University of Life Science (NMBU) for sequencing. DNA fragments shorter than 25 kb were removed by size selection with Circulomics' Short Read Eliminator kit, and seven libraries were prepared for each sample using the SQK-LSK110 kit (Oxford Nanopore Technologies).
Sequencing was performed on a PromethION24 in short serial runs following protocol NFL_9076_v109_revA. Each sequencing run was terminated after a few hours, when the number of active pores dropped to below 10%, in order to recover pores by nuclease-flushing flow cells, which were then refilled with the same DNA preparation for a next short sequencing run. Two FLO-PRO002 flow cells were used for each sample, which were each filled with six and five loadings, respectively, in order to obtain an approximate coverage of 20X. Basecalling was done with Guppy version 5.0.13 (high-accuracy basecalling model) and raw reads were filtered for a minimum qscore of nine. The average yield for the four samples was 47.1 Gb of DNA, while the mean N50 was 39.5 kb.
Characterization of genetic variation Raw sequencing data preprocessing Short readsThe Ssal_Brian_v1.0 assembly, derived from a North American wild salmon from Newfoundland (Norwegian University of Life Sciences, 2022; GenBank assembly accession: GCA_923944775.1; project accession: CAKLZZ000000000.1), was used as the reference genome for all downstream analyses. This genome features 28 chromosomes with two known polymorphic rearrangements, i.e., the translocation of chromosome ssa01's p arm (ssa01p) to ssa23 (ssa01-23) (Lehnert et al., 2019), and the fusion of chromosomes ssa26 and ssa28 (Brenna-Hansen et al., 2012).
Raw Illumina data was processed using the wgs_sample_preparation pipeline (
Since each sequencing run produced multiple raw read files, all fastq files obtained for a given sample were first concatenated to yield a single fastq file per sample. Raw reads were filtered for an average minimum quality of 10 and a minimum read length of 1000 bp using NanoFilt 2.0.8 (De Coster et al., 2018). We mapped filtered reads to the Ssal_Brian_v1.0 assembly with Winnowmap version 2.03 (Jain et al., 2020, 2022) using default parameters and a k-mer size of 15 (−k 15). The complete preprocessing pipeline (ONT_data_processing v1.0.0) can be found at
SNPs and small indels were called exclusively from short-read data, as higher basecalling error rates in long-read data are likely to interfere with SNP detection (Ahsan et al., 2021; Rang et al., 2018). Variant calling was performed in all 60 samples at once and for each chromosome separately, using bcftools mpileup and call (version 1.16) and requiring a minimum mapping quality of five at a given site (−q 5). The 28 single chromosome VCF files were then concatenated with bcftools concat.
In order to apply the same filtering criteria as SVs (described below), samples without at least four supporting reads and a minimum genotype quality of five for a given variant, or that had more than five times the anticipated whole genome short-read sequencing coverage (80) or an exceedingly high genotype quality (GQ = 127), were assigned the genotype “missing” (“./.”) using bcftools +set-GT (version 1.15). Finally, we kept SNPs and small indels that had a minor allele frequency between 0.05 and 0.95 and that were genotyped in at least 50% of samples (i.e., population-scale filters), using bcftools filter (version 1.13). The full SNP and indel calling pipeline is available at
In order to alleviate some of the challenges inherent to SV detection (e.g., low precision), we proposed an ensemble approach where SVs were first called independently with three separate tools, then merged across tools in order to obtain a union callset, which we filtered for calls supported by at least two callers. We assumed that SVs confidently called by multiple tools are more likely to be true positives than SVs called by a single tool.
The three callers used in combination were chosen based on reported performance in previous studies and benchmarks (Cameron et al., 2019; Kosugi et al., 2019; Mérot et al., 2023; Stenløkk, 2023). Each caller was provided with the same 60 bam files, as well as the reference genome used for mapping reads. We first ran DELLY (version 1.1.6; Rausch et al., 2012) following guidelines for germline calling in high coverage genomes (
The three SV sets were then merged together using Jasmine version 1.1.5 (Kirsche et al., 2023), which integrates various information including chromosome, position, end, size and type to determine whether SV calls from different files or samples refer to the same SV or not. We ran Jasmine with parameters “--mutual_distance --max_dist_linear = 0.25”, so that the maximum allowed distance required between two SVs for them to be merged is correlated with their size. The merged VCF was then edited with a custom R script (R version 4.1.2; R Core Team, 2021) in order to convert symbolic alternate alleles to explicit sequences and to standardize VCF fields. The formatted merged VCF was finally filtered for calls supported by at least two callers, with bcftools filter. Moreover, in accordance with the most prevalent definition of SVs (Feuk, Carson, & Scherer, 2006), variants smaller than 50 bp were considered as small indels instead of SVs and were therefore filtered out. This first set of merged SVs will be referred to as the short-read SV set (SR SVs). The detailed short-read SV calling pipeline can be found at
The SV calling procedure for Oxford Nanopore data is equivalent to the pipeline described above for SV detection from short reads, i.e., independent SV detection with three different tools, merging of SV calls across callers, and filtering for calls supported by at least two tools. However, since most long-read-based SV callers do not support multisample calling, SVs were first called separately for each sample using all three chosen tools, then merged across samples (across-sample merge) to obtain a single VCF per caller. The three multisample VCFs were then merged together (across-caller merge) to obtain the long-read SV set (LR SVs). The pipeline is available at
We ran Sniffles 2.0.7 (Sedlazeck, Rescheneder, et al., 2018; Smolka et al., 2022) (default settings and “--output-rnames --combine-consensus” options) on each sample and filtered for PASS and PRECISE calls. We then refined alternate allele sequences and breakpoints for insertions, deletions and some duplications by running Iris (Kirsche et al., 2023): we first preprocessed each sample's VCF with Jasmine (“--dup_to_ins --preprocess_only”), and ran Iris with parameters “--keep_long_variants --also_deletions”. The four samples' refined VCFs were merged together using Jasmine --ignore_strand --mutual_distance --allow_intrasample --output_genotypes, and refined SVs were then converted back to their original type with Jasmine --dup_to_ins --postprocess_only. The multisample Sniffles VCF was finally filtered again for PASS and PRECISE insertions, deletions, duplications and inversions. SVs were also called with SVIM 2.0.0 (Heller & Vingron, 2019) using parameters “--insertion_sequences --read_names --max_consensus_length=50000 --interspersed_duplications_as_insertions”, following the same procedure as Sniffles to produce a multisample VCF with PASS calls. Last, we used NanoVar 1.4.1 (Tham et al., 2020) with default settings. Supporting reads' names were added manually using a custom R script in order to allow for the refinement of SV breakpoints by Iris. The three multisample VCFs were finally merged together with Jasmine, formatted with custom R scripts and filtered, as described for the SR SV set.
Combination ofA final merging step was performed for combining the short-read and long-read SV sets using Jasmine with parameters “--ignore_strand --ignore_merged_inputs --normalize_type --output_genotypes”, resulting in a large union set of candidate SVs to be genotyped in the 60 Atlantic salmon genomes (
We implemented a graph-based genotyping pipeline (
In an effort to better link the genomic context and the confidence in SV calling, we compared the frequency distributions of both high-quality SVs that passed all filtering steps and low-quality, filtered out SVs in two genomic features known to interfere with SV calling: highly similar regions resulting from whole-genome duplication events (e.g., syntenic regions) and repeated content (repeats and transposable elements). To identify syntenic regions, we followed the steps described by Dallaire et al. (2023): in summary, we aligned the genome to itself with nucmer (built-in mapper in MUMmer version 4.0.0; Marçais et al., 2018), then performed synteny analysis with SyMAP (Soderlund et al., 2011), and re-mapped syntenic blocks to the genome with LASTZ (version 1.04.15; Harris, 2007) to get the homology percentage. We identified repeats and transposable elements using RepeatMasker (version 4.0.8; Smit et al., 2013). To distinguish between probable false positive SVs and probable true positive SVs, we relied on the filtering criteria we applied on the sample level (read depth and genotype quality) and on the population level (on minor allele frequency and missing data proportion) during the genotyping procedure. Using bedtools window, we then extracted excluded and filtered SVs overlapping with either a syntenic region or a repeat region (within a 100-bp window), or both.
In addition, because information on putative SVs is lost at the genotyping step, we applied the procedure described in Methods S1 to match genotyped SVs with a known putative SV based on the correspondence of position and allele length, in order to retrieve information on variant type, length and platform support (e.g., short- and/or long-read). This information allowed us to perform additional analyses on the set of genotyped and matched SVs. Indeed, in order to see if long-read SVs could be reliably genotyped from short-read data and a pangenome, we examined the concordance between the genotypes called by vg for long-read SVs and the genotypes called by long-read-based SV callers prior to merging datasets across platforms. First, for the four samples sequenced with both short- and long-read platforms, we extracted the three genotypes (from Sniffles, SVIM and NanoVar) for each long-read SV. We then determined the consensus genotype when possible, e.g., if at least two callers called the same genotype in a given sample. Each SV was labelled as concordant when its consensus genotype matched the corresponding genotype outputted by vg, or as non-concordant if its consensus genotype differed from the vg genotype. Alternatively, when all three callers outputted different genotypes for a given SV in a given sample, no consensus genotype could be inferred, and therefore the concordance between callers and vg could not be determined. We also performed this procedure for short-read SVs for comparison purposes. This concordance analysis is detailed in the scripts compare_GTs_LR_vs_vg.sh and compare_GTs_LR_vs_vg.R from the genotype_SVs_SRLR pipeline.
Population genomics analyses Differentiation between the Romaine and Puyjalon populationsWe used ANGSD version 0.937 (Korneliussen et al., 2014) for performing various population and comparative genomics analyses on SVs, SNPs and small indels separately by adapting a previous pipeline designed for SNPs (
We employed two complementary approaches for identifying candidate variants likely involved in local adaptation (Figure 1c). We first extracted the most highly differentiated variants falling within the upper 97% per-variant FST quantile (FST outliers). We also performed Fisher's exact tests on per-population allelic counts at each site and identified outliers with a corrected p-value (q-value) lower than 0.01 (Benjamini & Hochberg, 1995). We then extracted common outliers between FST and Fisher exact tests to yield a set of strongly differentiated variants used for further analysis.
Second, we ran a redundancy analysis (RDA) on the imputed genotype matrix, with the population as the only explanatory variable using the R package vegan (Oksanen et al., 2022). While FST and Fisher's exact test are more likely to detect outlier loci of large effect, RDA allows identifying covarying markers with individually weak effect that may be involved in polygenic control of phenotypic expression (Forester, Lasky, et al., 2018; Rellstab et al., 2015), as previously documented for life history traits such as age at sexual maturity or growth rate (Debes et al., 2021; Sinclair-Waters et al., 2020). We defined RDA candidates as variants with loadings falling over the three standard deviations threshold (Forester, Laporte & Manel, 2018). We thus obtained a set of outlier variants and a set of candidate variants for each of the three variant types studied.
Functional analysis of candidate genomic variationIn order to assess the potential functional impact of candidate variants on life-history trait variation observed in Romaine salmon, we investigated the overlap between variants of interest and known genes. We first annotated the Ssal_Brian_v1.0 assembly using the pipeline GAWN v0.3.5 (
For each set of variants of interest, we ran bcftools window (version 2.30.0; Quinlan & Hall, 2010) to identify a set of overlapping genes located within 10 kb of at least one variant. We then performed Gene Ontology (GO) enrichment analysis on each gene set with goatools 1.2.3 (Klopfenstein et al., 2018), using the list of 36,697 of genes from GAWN as the background (population) set and the go-basic database version 1.2 (2022-07-01;
SVs were identified through our multistep calling procedure involving both short- and long-read data, where different callers and datasets showed high variability in the number, types and sizes of SVs detected. Indeed, short reads revealed mostly deletions, whereas long reads allowed the detection of many more SVs, especially deletions, insertions and duplications. Among short-read-based callers, Manta reported the most SVs of various types and sizes (151,103), while smoove called the least (28,164), almost exclusively deletions (Table 1; Figure S1). In total, 238,492 SV calls were merged across the three callers, of which only 15.5% (37,041) were shared by at least two tools and were longer than 50 bp: this short-read SV set primarily consisted of deletions (34,761) smaller than 100 bp, and very few duplications (318) and insertions (849) (Table 2; Figure S2).
TABLE 1 Number of SVs reported by each short-read-based caller, and number of SV calls merged across these callers.
Abbreviations: DELs, deletions; DUPs, duplications; INSs, insertions; INVs, inversions.
TABLE 2 Number of SVs in the filtered short-read SV set (SR SVs), obtained by merging calls across the three short-reads-based callers, then filtering for a minimum of two supporting tools and a minimum length of 50 bp.
Abbreviations: DELs, deletions; DUPs, duplications; INSs, insertions; INVs, inversions.
For the long-read pipeline, SVIM called over 3.5 million SVs, mostly insertions and deletions, more than the two other callers combined (Table 3; Figure S3). The NanoVar callset was the smallest (454,697 SVs) but featured the largest proportions of duplications (31.8%) and inversions (16.4%) (Table 3). The proportion of all merged long-read SV calls (3,832,032) supported by multiple tools was 12.5%, smaller than for the merged short-read SV set. The 345,695 long-read SVs supported by at least two tools and longer than 50 bp retained for subsequent steps were mostly deletions (57.7%) and insertions (40.0%) and included only 208 inversions (Table 4; Figure S4).
TABLE 3 Number of SVs reported by each long-read-based caller, and number of SV calls merged across these callers.
Abbreviations: DELs, deletions; DUPs, duplications; INSs, insertions; INVs, inversions.
TABLE 4 Number of SVs in the filtered long-read SV set (LR SVs), obtained by merging calls across the three long-read-based callers, then filtering for a minimum of two supporting tools and a minimum length of 50 bp.
Abbreviations: DELs, deletions; DUPs, duplications; INSs, insertions; INVs, inversions.
Merging both short- and long-read SVs yielded a set of 361,107 putative SVs, mainly deletions (59.1%) and insertions (38.3%) (Table 5 & Figure 2). The vast majority of these merged SVs, i.e., 89.7%, were exclusively called from long reads, including almost all (99.4%) of the 138,404 insertions identified (Table S2). Similarly, 96.0% of duplications and 83.7% of deletions were uniquely detected from long reads. By contrast, only 15,412 SVs, or 4.3% of the merged SV set (Table 5), were unique to the short-read SV callset, including nonetheless 83.3% of all inversions identified (Table S2). Moreover, the 21,629 SVs called from both short and long reads accounted for 5.9% of all merged SVs (Table 5), meaning that 58% of all short-read SVs were also supported by long-read data.
TABLE 5 Merged SV count by type and sequencing platform (SR: short-read; LR: long-read; SR + LR: short- and long-read). These represent putative SVs in the genomes of Romaine and Puyjalon salmon, prior to genotyping and filtering.
Abbreviations: DELs, deletions; DUPs, duplications; INSs, insertions; INVs, inversions.
FIGURE 2. Merged SV count by SV type, length and sequencing platform (SR: short-read; LR: long-read; SR + LR: short- and long-read). These represent putative SVs in the genomes of Romaine and Puyjalon salmon, prior to genotyping (DELs, deletions; DUPs, duplications; INSs, insertions; INVs, inversions).
The merged SVs were represented in a variant-aware genome graph on which we mapped short-read data to genotype SVs in all 60 individuals. On average, 333,031 SVs were genotyped in each sample using the graph-based pipeline, for a total of 344,468 distinct SVs. As expected owing to the complex nature of SVs, the average proportion of missing data per site was over four times larger than for raw SNPs (Table S3). 40.8% of raw genotyped SVs did not meet the minimum coverage required in at least half of samples, while two thirds had a minor allele frequency under 5% (Table S4). Consequently, filtering on both proportion of missing data and minor allele frequency considerably reduced the SV set to 115,907 high-confidence variants (Table 6), or about 33.6% of the 344,468 raw genotyped SVs. These 115,907 SVs were used for subsequent population genomics analyses.
TABLE 6 Number of filtered SVs, SNPs and small indels used for population genomics analyses along with summary statistics for each variant set.
Variant type | Number of variants | Genome base pairs covered | F ST | |||
Number | Proportion | Maximum per-window proportion | Genome-wide weighted average | Per-variant maximum | ||
SVs | 115,907 | 41,858,002 | 0.0168 | 0.941 | 0.044 | 0.702 |
SNPs | 8,777,832 | 8,777,832 | 0.0035 | 0.029 | 0.065 | 0.981 |
Indels | 1,089,321 | 10,316,768 | 0.0041 | 0.102 | 0.079 | 0.949 |
Note: These variants met the population-level filters, i.e., had a minor allele frequency between 0.05 and 0.95 and were genotyped in at least 50% of the 60 samples. The proportion of base pairs covered by variants of a given type was estimated for the whole genome and by 100-kb windows.
While SNPs were the most frequent type of variants, SVs impacted a much higher proportion of genome base pairs, with large heterogeneity along the genome. Indeed, in addition to the final 115,907 SVs, we identified 8,777,832 SNPs and 1,089,321 short indels (Table 6) that met the same filters on minor allele frequency and proportion of missing data. SVs added up to over 41.8 Mb (including insertions), or about 1.68% of total genome length, which was approximately 4.8 times more than SNPs (total length: 8.7 Mb; Table 6) and four times more than small indels (total length: 10.3 Mb; Table 6). Similarly, the proportion of base pairs covered by a given variant type per 100-kb window was much more variable for SVs, reaching as much as 94.0% for some regions (e.g., around a 94.1-kb deletion on chromosome ssa09), whereas the maximum observed proportion of base pairs covered by SNPs was only 2.9% and 10.2% for indels (Table 6 & Figure 3).
FIGURE 3. Proportion of base pairs covered by filtered SVs (top), SNPs (middle) and short indels (bottom) per 100-kb windows along the genome. Each vertical panel represents one chromosome.
If we consider the occurrence of variants instead of the number of variable bases, there was no considerable difference in variant density along the genome, as SVs, SNPs and small indels all tended to be more frequent towards the extremities of chromosomes (Figure S5). The gap observed in the number of SVs by 100-kb window (SV density) on chromosome ssa10 could be attributable to a large 2.5-Mb deletion called from short reads, but not successfully genotyped by graphs (Figure S5); this gap was likely not as obvious with SNPs and indels, since numerous markers (over 61,000 SNPs and 8000 indels) could still be genotyped in samples that did not have this putative deletion.
All genetic variants underlay a congruent population structureDespite differences in amount and proportion of genome base pairs covered, SVs, SNPs and small indels displayed a consistent population structure and genetic differentiation between the Romaine and Puyjalon populations. Principal component analysis revealed an important differentiation between the two populations, with individual salmon clustering by river along PC1 while PC2 explained variation within Puyjalon samples (Figure 4). This pattern was strongly conserved across all variant types and confirmed anticipated population structure from a previous study (Albert & Bernatchez, 2006). Average genome-wide weighted FST values ranged from 0.044 for SVs to 0.065 for SNPs and 0.079 for small indels (Table 6). This observation, along with the strong clustering of samples in principal component analysis, indicates moderate levels of differentiation between Romaine and Puyjalon samples.
FIGURE 4. Population structure of Romaine (RO) and Puyjalon (PU) salmon based on principal component analysis from filtered and genotyped (a) SVs, (b) SNPs and (c) short indels. Each point represents one of the 60 salmon sampled for the study.
Differentiation along the genome was also highly variable, as regions of strong per-window FST (>0.2) were numerous and dispersed on all chromosomes (Figure 5). These high FST peaks were overall consistent across SVs, SNPs and short indels. However, the correlation was the greatest between SNPs and indels (R2 = 0.930, p < 0.001), which shared multiple peaks and tended to have higher FST than SVs. SVs displayed a weaker correlation with SNPs (R2 = 0.612, p < 0.001) and indels (R2 = 0.595, p < 0.001). Only a few per-window FST peaks were unique to SVs (e.g., on chromosomes ssa01q, 10, 29; Figure 5). Per-variant FST distribution also differed between variant types: values ranged from 0 to 0.981 for SNPs, whereas the maximum per-variant FST observed for SVs was 0.702 (Table 6).
FIGURE 5. FST along the genome between Romaine and Puyjalon salmon, estimated by 100-kb windows (with 10-kb steps) from SVs (top), SNPs (middle) and short indels (bottom). Dotted white lines correspond to the weighted mean genome wide FST for each variant type. Each vertical panel represents one chromosome.
Among all filtered variants of each type, we identified those that showed a possible relevance in the putative local adaptation of Romaine and Puyjalon salmon. For each variant type, we reported the most strongly differentiated variants between both populations, as these could include major-effect variants, as well as RDA candidates, which might instead reveal multiple, small-effect loci. We identified 1.62 times more outliers of differentiation than RDA candidates for SVs, whereas SNPs and small indels showed the opposite trend, with 1.59 and 2.23 times more RDA candidates than outliers for SNPs and indels, respectively (Table 7). These outlier and RDA candidate variants did not have more missing data than the others, non-outlier and non-candidate variants (Figure S6), meaning that the apparent differentiation and multilocus signals were not artificially driven by missing genotypes. For each of these six sets of variants of interest, we reported a set of overlapping genes within 10 kb, ranging from only 940 genes for RDA candidate SVs to 15,711 genes for RDA candidate SNPs (Table 8). GO enrichment analysis performed on each of these six gene sets revealed various biological processes, with a redundancy of terms associated with cellular structure and nervous system function. The 1407 genes located near SV outliers were enriched for 108 terms mostly related to cellular adhesion and junction, as well as to synapse organization (Table S5), whereas RDA candidate SV genes were enriched for only 27 GO terms clustered under “chemical synaptic transmission” (Table S6). Many of the GO terms that were enriched for outlier SV genes, such as cell adhesion and junction, synapse organization or developmental processes, were also among the 223 enriched terms for outlier SNP genes (Table S7). A much wider range of biological functions were overrepresented in the 528 GO terms associated with RDA candidate SNP genes, including growth and nervous system development (Table S8). Finally, genes that overlapped with outlier and RDA candidate indels were enriched for fewer GO terms (212 and 292, respectively), but enriched terms themselves were similar to those reported for the outlier and RDA candidate SNP gene sets (Tables S9 and S10). Raw GO enrichment result tables (prior to simplification with REVIGO) can be found in Data S1.
TABLE 7 Outlier and candidate variant sets for each type of polymorphism.
Note: The intersection outlier set corresponds to variants that were among the upper 3% quantile of per site FST and that had a Fisher test q-value lower than the 0.01 threshold, whereas RDA candidate variants were identified through RDA (redundancy analysis) using a threshold of three standard deviations.
TABLE 8 Number of genes overlapping intersection outlier variants and RDA candidate variants for each type of polymorphism, along with the number of significant enriched GO terms for each gene set. Variants and genes were considered as overlapping if they were located within 10 kb of each other, either from their start or their end positions.
Variant type | Outliers of differentiation | RDA candidates | ||||
Variants | Nearby genes | Significant enriched GO terms | Variants | Nearby genes | Significant enriched GO terms | |
SVs | 2079 | 1407 | 108 | 1280 | 940 | 27 |
SNPs | 71,953 | 11,578 | 223 | 114,637 | 15,711 | 528 |
Indels | 6223 | 3285 | 212 | 13,871 | 5872 | 292 |
Structural variants now appear as major components of genetic polymorphism with increasingly recognized implications for phenotype and adaptation, but they are inherently difficult to characterize, especially at the population level. By proposing a hybrid pipeline combining short- and long-read sequencing with graph-based genotyping, we aimed to alleviate some of the challenges hampering the study of SVs in population genomics. Applying this pipeline to a set of 60 Atlantic salmon genomes allowed us to catalog a wide spectrum of polymorphism, from SNPs to multi-kilobase SVs. From this catalog, we described the population structure of salmon populations inhabiting two adjacent tributaries, the Romaine and Puyjalon rivers, estimated the level of differentiation between them, and uncovered putative variants underlying local adaptation.
Multiplatform variant detection and pangenome-based approaches facilitate population-scale analysis ofThe combination of long- and short-read sequencing with graph-based genotyping is a promising, yet incomplete solution to the challenges raised by analyzing SVs at the population scale. Our results indicate that long reads are a highly valuable asset for SV characterization, because the vast majority of SVs were identified from the only four genomes sequenced in Oxford Nanopore long reads, despite high-coverage (16X) paired-end Illumina short-read data being available for all 60 samples. Long-read data was particularly crucial for detecting putative insertions and characterizing their sequence, as expected from previous studies (Ho et al., 2020). Since insertions and deletions are the most prevalent forms of SVs in genomes (Gamazon & Stranger, 2015), the incorporation of third generation sequencing enabled a considerable gain in the amount of genetic variation uncovered. However, elevated costs remain an obstacle for population-scale implementation of long-read sequencing, especially for species with large genomes such as salmonids, with prices ranging from 22 to 200 $ USD per gigabase depending on technology used (De Coster et al., 2021). Here, graph-based genotyping allowed us to exploit short reads to genotype putative SVs in a much higher number of samples whose genomes were only sequenced in short reads. Indeed, about 93.9% of the 115,907 final, filtered and genotyped SVs were initially called from long reads (or from both short and long reads; Table S11) and could be genotyped in more than half of all samples, indicating that even though short reads are suboptimal for SV detection, they are highly relevant for genotyping SVs in populations. The hybrid approach thus offered a major increase in overall SV genotyping power at a much lower cost, which therefore represents a reasonable compromise between cost and efficiency.
This multiplatform approach nevertheless does not address all issues pertaining to SV characterization. First, merging SV calls across tools relies on arbitrary thresholds and may be suboptimal, depending on the precision of SV characterization procedures. We made the arbitrary decision to retain only SVs that were supported by at least two callers, which might be an overly conservative filtering approach. Given the high false discovery rates previously reported for SV calling in the Atlantic salmon genome (Bertolotti et al., 2020), we chose to attempt improving precision over sensitivity. However, multitool calling is not guaranteed to improve precision nor sensitivity, depending on datasets and callers used in combination, and callers relying on the same evidence types are likely to call the same false positives (Chaisson et al., 2019; Kosugi et al., 2019; Mahmoud et al., 2019).
In addition, despite using Jasmine, an up-to-date graph-based minimum spanning forest algorithm that integrates numerous SV parameters, a surprisingly small number of all SVs called by multiple tools could be merged together. This is especially true for merged long-read SVs, for which the proportion of shared calls was even smaller than in the merged short-read SV set. We suggest that the lower basecalling accuracy inherent to third generation sequencing data might increase breakpoint imprecision (Lemay et al., 2022) and lead to undermerging of shared variants, especially for those that could not be refined by Iris. For example, long-read-based callers each reported a few thousand inversions, but only 208 were successfully merged across Sniffles and NanoVar. This likely results from imprecision of inversion breakpoints (Sudmant et al., 2015), as the breakpoints refinement tool (Iris) cannot process inversions at this time, or from differences in how inversions are identified or reported by different tools. All these issues highlight the need for more efficient validation methods for large SV datasets. Since visualization remains the most robust SV validation method (Spies et al., 2015), emerging machine learning-based and/or automated methods show promising applications for this purpose, such as samplot-ML (Belyeu et al., 2021), MAVIS (Reisle et al., 2019) and DeepSVFilter (Liu et al., 2020), but also for SV detection and genotyping (Cue; Popic et al., 2023). We can expect that such tools, which currently primarily target short-read data, will eventually allow automated long-read SV validation as well, thus further improving SV analysis in the upcoming years.
Although graph-based genotyping has been essential for genotyping long-read SVs from short-read data, it is not immune to bias in SV characterization. We explicitly treated genotypes with insufficient read support (or genotype quality) as missing data, and up to 40% of SVs were filtered out due to missing genotypes in at least half of the samples, meaning that very few reads could be mapped to these SV regions in the pangenome. Since long-read SVs had a consistently higher proportion of missing genotypes than short-read SVs, both before and after filtering on genotype quality, depth and minor allele frequency (Figure S7), we speculate that some short reads still cannot be accurately mapped to certain SV regions where long reads could be confidently aligned. This might also explain the lower concordance between the genotypes outputted by vg and the genotypes provided by caller genotypes for candidate long-read SVs than for candidate short-read SVs (Table S12). As stated above, the higher breakpoint imprecision for long-read SVs might also increase noise around SV positions and therefore contribute to the poorer mapping of short reads to the graph. Various features of the Atlantic salmon genome are known to promote spurious mapping of reads to the linear reference genome, such as residual tetrasomy (10 to 20%; Houston & Macqueen, 2019), highly similar duplicated regions (81–89%; Davidson et al., 2010) and a large proportion of repeats (50–60%; de Boer et al., 2007). We can expect that these features also impact pangenome-based mapping and genotyping to some extent, especially repeats (Chen et al., 2019; Outten & Warren, 2021). Indeed, low-confidence SVs that were filtered out after genotyping were more prevalent in regions with repeated contents (transposable elements and repeats) and/or in syntenic regions with elevated levels of homology following a past whole-genome duplication event in salmonids (Figure S8). Second, very large putative SVs spanning considerable chromosomal regions, such as the 2.5-Mb deletion on chromosome ssa10, could not be successfully genotyped using graphs, likely because such large rearrangements cannot be reliably represented by complex graph structures (Hübner, 2022). This limitation is particularly problematic for the study of SVs in the context of population genomics, as larger rearrangements were found to play a key role in adaptive processes (Wellenreuther et al., 2019; Wellenreuther & Bernatchez, 2018). Alternatively, some of the very large candidate SVs that were not successfully genotyped could have been false positives, as over half of putative SVs larger than 30 kb were inversions and deletions exclusively supported by short-read callers (Table S13). Our study would therefore benefit from the addition of complementary approaches, such as assembly comparison or chromatin conformation data, to identify, validate and genotype large SVs (Mérot et al., 2020), hence further expanding the range of SVs identified. Despite these limitations, the multiplatform strategy developed for this study represents a considerable improvement over short-read-only approaches, and the incorporation of novel automated curation approaches will undoubtedly lead to further advancements in population-scale SV characterization.
Our findings showed that the contribution of SVs to standing genetic polymorphism is important in Atlantic salmon. High-confidence, genotyped SVs accounted for 4.8 times more genome base pairs than SNPs. This proportion is in the same order of magnitude as previously estimated using an equivalent approach in lake whitefish, a closely related species (e.g., five times; Mérot et al., 2023). The number of SVs identified in our study is over seven times larger than previously documented in rainbow trout (almost 14,000 SVs; Liu et al., 2021) and in Atlantic salmon (over 15,000 SVs; Bertolotti et al., 2020) in previous studies involving more samples, but relying only on short-read data. Similarly, we reported between 20 and 30 SVs per 100-kb window, whereas the median per-megabase SV count reported by Bertolotti et al. (2020) is under 10. SV counts described in our study might be inflated due to a certain number of false positives in our dataset, since we did not exclude calls located in problematic regions (e.g., high coverage regions, assembly gaps and low complexity regions) nor performed manual curation of SVs. However, such an important difference in SV count can most likely be explained by the integration of long-read sequencing data. Indeed, in the human genome, over six times more high-confidence SVs were identified from long reads (27,662 SVs; Chaisson et al., 2019) than from short reads in another study (4442 SVs; Abel et al., 2020).
SVs also appear to reliably capture population structure and differentiation to the same extent as SNPs. The very high correspondence of population structure inferred from PCA across variant types was also observed in previous studies of SVs in soybean (Lemay et al., 2022), in cacao (Theobroma cacao; Hämälä et al., 2021), in grapevine (Vitis vinifera ssp. Sativa; Zhou et al., 2019), lake whitefish (Mérot et al., 2023) and Corvus genus species (Weissensteiner et al., 2020). Patterns of fluctuations in per-window variant density and FST along the genome were also strongly conserved, e.g., regions of high SV density were usually dense in SNPs and short indels as well. We reported a quick linkage disequilibrium decay in all pairs of variants, with minimal linkage between SNPs and SVs for distances greater than 250 bp (Figure S9). This suggests that the observed correspondence between all three types of variants might not be attributable to strong physical linkage between them, but rather that SVs, SNPs and short indels may be subject to similar evolutionary processes in the Romaine and Puyjalon system, despite them being very different in size. Per-variant and per-window FST was usually slightly lower for SVs than for SNPs and small indels, which was also reported in the lake whitefish study (Mérot et al., 2023). We suspect that this slight discrepancy is attributable to the fact that the FST calculation relies on fewer markers for SVs, thus introducing more noise in FST estimates than with SNPs and small indels.
By contrast, in American lobster, a non-related and less structured marine species (Benestan et al., 2015; Kenchington et al., 2009), copy number variants harbored stronger interpopulation differentiation than SNPs and a more defined population structure, correlated with environmental variables (Dorant et al., 2020). Similarly, deletions showed stronger spatial population structure and were under stronger selection than duplications in human populations (Sudmant, Mallick, et al., 2015). In European starling (Sturnus vulgaris), SVs and SNPs revealed different patterns of population structure, interpopulation genetic diversity and divergence across the genome (Stuart et al., 2023). Consequently, we cannot assume that SVs, SNPs and small indels are interchangeable and equally informative in all systems and species as we observed in the Romaine and Puyjalon system. We therefore argue that we ought to characterize SVs in population genomics studies to the same extent as SNPs, as they may display different signatures and provide relevant insights on evolutionary and adaptive processes shaping population structure.
Genetic divergence between the Romaine and Puyjalon populations is likely driven by divergent selectionGiven their spatial proximity and habitat overlap, gene flow is expected to occur between the Romaine and Puyjalon populations. The moderate average genome-wide FST values estimated from the different classes of variants, as well as a previous estimation based on microsatellites (Albert & Bernatchez, 2006), are consistent with a rate of two to seven migrants per generation, based on Wright's approximation (1984). However, we reported numerous outliers of differentiation and peaks of very strong FST dispersed throughout the genome that are seemingly resisting such gene flow. The persistence of pronounced divergence in these regions could be explained by a few alternative and not mutually exclusive mechanisms. Genetic drift could lead to random differences in variant allelic frequency in both populations. Alternatively, given that recombination rates are known to differ within species and even within populations (Kong et al., 2010; Ritz et al., 2017), some localized regions of low recombination could have emerged independently in both populations, capturing different alleles and being subject to increased genetic drift as a result of apparent reduced effective population size (Ne). Such “differentiation islands” can result from the interplay of variation in recombination rate due to genetic architecture (e.g., the presence of SVs or large rearrangements) and natural selection (Wolf & Ellegren, 2017). Finally, variants with a functional impact could be subject to divergent selection between habitats leading to differences in allelic frequencies.
Our findings tend to support the hypothesis of local adaptation. First, we reported a repeated enrichment for GO terms related to nervous system function for genes nearby outlier and RDA candidate variants, regardless of variant type. We initially expected enrichment for functions related to other observed phenotypic trait variation in the Romaine and Puyjalon system, such as growth, sexual maturation and reproduction. On the contrary, we observed enrichment mainly related to nervous functions. We hypothesize that enrichment for nervous functions could be linked to variation in these traits through their link with age at smoltification that differ between these two populations. Indeed, changes in photoperiod, which are recognized as the main factor triggering smoltification (Hoar, 1988), are perceived and processed through the light-brain-pituitary axis, inducing the hormonal cascade responsible for physiological, morphological and behavioral changes underlying smolt-to-parr transition (Stefansson et al., 2008). Smoltification itself causes reorganization of nervous connections, both at the structural and the chemical level (Ebbesson et al., 2003). Although empirical evidence is required to support this hypothesis, polymorphism around genes involved in nervous system function, development and plasticity could alter the expression or function of these genes and lead to physiological differences underlying variation in age at smoltification and other relevant life history traits in Romaine and Puyjalon salmon. This indirect, but plausible link between observed phenotypic variation and genetic polymorphism does not support the hypothesis of persistent genetic differentiation due to genetic drift alone. Moreover, peaks of differentiation are unlikely a result of low recombination alone because such peaks and outliers of differentiation are dispersed throughout chromosomes and across the whole genome, and not clustered into contiguous and localized regions. Along with preliminary knowledge of the Romaine and Puyjalon system, our results suggest that the persistence of localized regions of strong differentiation could at least partly be attributable to local adaptation in response to divergent selection. Indeed, both rivers differ in habitat quality, substrate and hydrological profiles (Schieffer,1975; Fontaine et al., 2000; GENIVAR, 2002; Belles-Isles et al., 2004; WSP Global, 2019), which may impose different constraints on salmon and thus favor alternate life strategies in both populations.
Besides showing variation in age at smoltification, Romaine and Puyjalon salmon also differ in regards to age at sexual maturity in a controlled hatchery environment (T. Dion, Chayer, et al., 2020; T. Dion, Langlois-Parisé, & Proulx, 2020; Langlois-Parisé et al., 2018; Therrien et al., 2017). Interestingly, we found no variant of interest overlapping with major-effect loci previously associated with life history variation in age at maturity in wild and domesticated European salmon populations, such as vgll3 (Ayllon et al., 2015; Barson et al., 2015; Czorlich et al., 2018) and six6 (Sinclair-Waters et al., 2020; Waters et al., 2021). While one study in North America highlighted a correlation between vgll3 polymorphism, sea age and sex in the Trinité river population (Kusche et al., 2017), located in the same geographic region as the Romaine and Puyjalon rivers, other studies did not reveal a significant association between polymorphism in these major-effect loci and age at maturity in other North American populations (Boulding et al., 2019; Mohamed et al., 2019). The genetic architecture of such complex life history traits is possibly variable across populations, especially between highly divergent populations from different continents. In addition, age at maturity was found to have a mixed genetic architecture in both North American-derived farmed salmon (Eisbrenner et al., 2014; Mohamed et al., 2019) and European-origin salmon (Sinclair-Waters et al., 2020), involving both major-effect loci and multiple small-effect loci. Since we identified numerous candidate small-effect variants through RDA, we propose that phenotype variation in age at maturity in Romaine and Puyjalon salmon might have a polygenic basis as well.
Further work is required to understand the genetic architecture of major life history trait variation in Atlantic salmon populations as well as other salmonid species. Such work would considerably benefit from an improved knowledge of the full spectrum of genetic variation segregating in populations, especially SVs. The pipelines developed and optimized for this study may therefore contribute this knowledge by facilitating population-scale characterization of SV, as well as serve as a basis for further refinement of variant calling and genotyping procedures in the near future. With ongoing and rapid developments in computational genomic approaches, such as pangenome-based tools or machine-learning-based variant detection and validation, SV analysis is bound to take a significant leap towards robust and reliable characterization in the upcoming years, which will foster their inclusion in evolutionary genomics.
ACKNOWLEDGMENTSWe are grateful to the staff of Laboratoire de Recherche en Sciences Aquatiques (LARSA) for fish maintenance and tissue sampling during this project. We also want to thank Éric Normandeau for bioinformatic support, Marc-André Lemay for advice on SV calling and genotyping, and research professionals who assisted with sampling and DNA extraction. We also thank the associate editor and three reviewers for their constructive and relevant comments on the manuscript. This project was funded by Société Saumon de la Rivière Romaine, Hydro-Québec, Ressources Aquatiques Québec and by a Collaborative Research and Development grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) to Louis Bernatchez. Laurie Lecomte was supported by a Canada Graduate Scholarship for Master's grant (NSERC) and a master's training scholarship from the Fonds de recherche du Québec – Nature and technologies (FRQNT). We dedicate this article to the memory of Louis Bernatchez, principal investigator in this project, who passed away during the revision process.
CONFLICT OF INTEREST STATEMENTAll authors have no conflicts of interest to disclose.
Dr. Louis Bernatchez is an Editorial Board member of Evolutionary Applications and a co-author of this article. To minimize bias, they were excluded from all editorial decision-making related to the acceptance of this article for publication.
DATA AVAILABILITY STATEMENTRaw sequencing data used for this study is available on NCBI's Sequence Read Archive (SRA) (BioProject accession PRJNA1066228;
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Genomic structural variants (SVs) are now recognized as an integral component of intraspecific polymorphism and are known to contribute to evolutionary processes in various organisms. However, they are inherently difficult to detect and genotype from readily available short-read sequencing data, and therefore remain poorly documented in wild populations. Salmonid species displaying strong interpopulation variability in both life history traits and habitat characteristics, such as Atlantic salmon (
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details


1 Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Canada; Département de Biologie, Université Laval, Québec, Canada
2 Department of Animal and Aquacultural Sciences (IHA), Faculty of Life Sciences (BIOVIT), Centre for Integrative Genetics (CIGENE), Norwegian University of Life Sciences (NMBU), Ås, Norway