Investigating structural variant, indel and

Full text

Turn on search term navigation

INTRODUCTION

Differences in DNA sequence and structure among individuals within species, referred to as genetic variation, serve as the basis for key evolutionary mechanisms such as speciation and local adaptation (Barrett & Schluter, 2008). Genetic variation can be described as a wide spectrum of variants of various sizes, ranging from single nucleotide polymorphisms (SNPs) to larger structural variants (SVs), which may span megabase-long stretches of DNA or even whole chromosomes (Feuk, Carson, & Scherer, 2006; Mérot et al., 2020; Wellenreuther & Bernatchez, 2018). SVs such as insertions, deletions, duplications and inversions are now recognized as the main component of genetic variation, as they affect at least two to eight times more bases in genomes than SNPs (Catanach et al., 2019; Hämälä et al., 2021; Mérot et al., 2020). This estimate tends to increase as our ability to detect SVs from high-throughput sequencing data is constantly improving (Ho et al., 2020; Mérot et al., 2020).

SVs are also known to have a broad range of consequences at various biological levels. At the molecular scale, they may influence gene dosage, gene expression, DNA interactions and tridimensional structure by altering genetic elements' proximity and copy number (Feuk, Marshall, et al., 2006; Gamazon & Stranger, 2015; Spielmann et al., 2018). SVs that disrupt collinearity between homologous chromosomes, especially large inversions, are also likely to restrict or suppress recombination (Crown et al., 2018; Rowan et al., 2019). This may result in an apparent reduced gene flow around SVs, which may link co-adapted alleles, thus promoting the formation of supergenes that may underlie complex and adaptive phenotypes (Kirkpatrick & Barton, 2006; Rieseberg, 2001; Thompson & Jiggins, 2014).

A growing body of evidence also suggests that SVs can be involved in evolutionary mechanisms in various species (reviewed in Wellenreuther & Bernatchez, 2018). For instance, supergenes arising from large inversions have been linked to adaptive variation in wing color patterns in Heliconius butterfly (Joron et al., 2011), to migratory behavior in rainbow trout (Oncorhynchus mykiss; Pearse et al., 2014; Pearse et al., 2019) and Atlantic cod (Gadus morhua; Kirubakaran et al., 2016; Berg et al., 2017), as well as to reproductive strategies in the ruff (Philomachus pugnax; Küpper et al., 2016) and the white-throated sparrow (Zonotrichia albicollis; Tuttle, 2003). Other key examples of inversion polymorphism involved in ecotype divergence and local adaptation have been documented in the seaweed fly (Coelopa frigida; Mérot et al., 2018) and in three-spined stickleback (Gasterosteus aculeatus; Jones et al., 2012). Besides large inversions, copy number variants (CNVs) have also been linked to adaptation to local temperature regimes in American lobster (Homarus americanus; Dorant et al., 2020) and to glacial lineage divergence in capelin (Mallotus villosus; Cayuela et al., 2021). Industrial melanism in peppered moths (Biston betularia), a textbook example of rapid adaptation to environmental change, has been associated with an intronic insertion in the cortex gene (Van't Hof et al., 2016). Similarly, a 2.25-kb intronic insertion would explain color pattern divergence among lineages in the Corvus genus, promoting reproductive isolation and thus leading to speciation (Weissensteiner et al., 2020).

Despite such well-documented cases of adaptive genomic rearrangements, most SVs other than large inversions remain understudied in a population genomics context. Relative to SNPs, very little is known about how such a large component of genetic variation is distributed within and between wild populations. Indeed, while standard procedures and pipelines are available for population-scale SNP calling, SV detection and genotyping, on the other hand, involve significant challenges for large, multisample datasets (Ho et al., 2020; Mahmoud et al., 2019).

Calling SVs requires specialized software due to their complexity and diversity in type and length. SV callers rely on various signals of discordance in read mapping relative to a reference genome in order to infer SVs in a given sample (Lin et al., 2015; Mahmoud et al., 2019). Because short-read sequencing is widely available and affordable, it is an appropriate technology for population-scale study of genetic variation. However, the performance of short-read-based SV callers is highly variable (Cameron et al., 2019). They are known to lack sensitivity, as true positive detection rates can be as low as 10% (Huddleston et al., 2017; Sedlazeck, Rescheneder, et al., 2018). They also show low precision, with high false discovery rates reaching 89% for some datasets (Mills et al., 2011), especially for calls near SNPs, indels, low-complexity regions and repeats (Cameron et al., 2019). In fact, short reads are hard to map to the reference genome owing to their small length, especially when they include numerous sequencing errors, repeats (Sedlazeck, Lee, et al., 2018) or sequences differing considerably from the reference, such as SVs. Spurious mapping may result in underreporting of variation, an issue known as reference allele bias (Brandt et al., 2015; Nielsen et al., 2011). Calling SVs in a given dataset using multiple callers (ensemble calling), may increase the range of SV types and sizes detected or reduce false discovery rate compared to single-tool SV calling. For instance, SV callsets may be merged across callers, then filtered for calls supported by a given minimum number of tools (Auton et al., 2015). However, the improvement in sensitivity and/or precision strongly depends on the callers used in combination (Kosugi et al., 2019; Mahmoud et al., 2019).

By contrast, recent advances in third generation sequencing platforms (Oxford Nanopore and Pacific Biosciences' technologies) have brought significant improvements regarding SV calling. Long reads span kilobase-long segments of DNA, thus fully overlapping SVs and their breakpoints, which considerably facilitates read mapping (Sedlazeck, Lee, et al., 2018) and improves sensitivity of SV detection (Mahmoud et al., 2019), especially for novel insertions (Ho et al., 2020). Specialized algorithms and pipelines have been developed to process long reads and account for their length and higher sequencing error rate (Delahaye & Nicolas, 2021; Rang et al., 2018). However, high costs prevent long-read sequencing from becoming a routine tool for population-scale SV studies for species with large genomes such as salmonid fishes (usually around 3 Gb), which requires the sequencing of many genomes, namely, for accurate estimates of allele frequency.

To provide an adequate balance between accurate SV characterization and genotyping in large datasets, emerging hybrid approaches can be considered, such as pairing affordable short-read sequencing for all samples with high performance third generation sequencing for a small subset of genomes only. Candidate SVs called from long reads can then be genotyped in all samples from short-read data using pangenome graphs, which offer considerable advantages over conventional, linear reference-based methods. Indeed, in a reference pangenome, the reference genome is represented as a base graph structure where known variants and alternate alleles are encoded as alternate paths, i.e., series of nodes and links (Paten et al., 2017). The integration of known genetic variation within the reference greatly facilitates mapping of reads that overlap such variants, thus improving both SV detection and genotyping, and reducing reference allele bias (Ameur, 2019). This approach has shown promising results for genome-wide population-scale SV detection in human (Homo sapiens; Yan et al., 2021), soybean (Glycine max; Lemay et al., 2022), lake whitefish (Coregonus clupeaformis; Mérot et al., 2023) and in kākāpō parrots (Strigops habroptilus; Wold et al., 2023).

Knowledge pertaining to SVs remains minimal in salmonid fishes, despite their genomes being extensively studied for aquaculture applications. The first comprehensive catalog of genome-wide SVs for Atlantic salmon (Salmo salar) was produced by Bertolotti et al. (2020) by calling putative SVs using short-read-based caller LUMPY (Layer et al., 2014) in 492 wild and domestic salmon from various populations in Europe and North America. SV calls were then manually curated with SV-plaudit (Belyeu et al., 2018) in order to eliminate false positives, yielding 15,483 high confidence SVs matching the expected population structure. This study also revealed a subset of outlier SVs overlapping genes enriched for brain expression, suggesting an implication in salmon domestication. Other population-scale SV catalogs were published for the rainbow trout (Oncorhynchus mykiss; Liu et al., 2021), and two sympatric sister species of lake whitefish (Coregonus sp.; Mérot et al., 2023).

Further work is required in order to fully appreciate SVs' relevance in the genomics and biology of Atlantic salmon, which could serve as an ideal candidate species for studying adaptive SVs and developing an efficient population-scale SV discovery pipeline. SVs and larger chromosomal rearrangements are likely a key feature of salmonid genomes, as they are critical to the reploidization process following the salmonid-specific fourth vertebrate whole-genome duplication that occurred at least 60 million years ago (Ss4R) (Allendorf & Thorgaard, 1984; Crête-Lafrenière et al., 2012; Lien et al., 2016). Sequence repeats, which account for 50 to 60% of the Atlantic salmon genome (de Boer et al., 2007), are also known to promote SV formation (Levy-Sakin et al., 2019). Moreover, Atlantic salmon display considerable life history trait variation both within and between wild populations (Klemetsen et al., 2003). Consequently, there is considerable interest in understanding the genetic architecture of traits such as growth rate and disease resistance for aquaculture (Gjedrem & Rye, 2018), but also in the context of local adaptation, which usually involves such life history trait variation (Fraser & Bernatchez, 2005; Lu & Bernatchez, 1999; Taylor, 1991). Local adaptation is expected to be a major driver of population structure in Atlantic salmon, given its homing behavior (Allendorf & Waples, 1996) and the variability in habitat conditions (Kawecki & Ebert, 2004; Taylor, 1991). Indeed, the association between the genetic structure of seven groups of local salmon populations in Eastern Canada and regional rivers' environmental parameters suggests adaptive SNP divergence among these groups (Bourret et al., 2013; Dionne et al., 2008). Previous studies have also highlighted a few adaptive large chromosomal rearrangements in wild Atlantic salmon populations (Watson et al., 2022; Wellband et al., 2019), as well as divergent SVs between domestic and wild populations (see Bertolotti et al., 2020). However, SVs' contribution to local adaptation remains poorly documented among North American populations, especially at a finer geographic scale (e.g., within neighboring rivers).

Two parapatric Atlantic salmon populations from the Romaine and Puyjalon rivers (Québec, Canada; 50.306337, −63.795602; Figure 1b) represent a prime case of putative fine-scale local adaptation. Indeed, admixture analysis and fixation index calculation (F_ST = 0.036) based on microsatellite markers showed moderate differentiation between Romaine (RO) and Puyjalon (PU) salmon, despite their geographical proximity and habitat connectivity (Albert & Bernatchez, 2006). Furthermore, they exhibit different trade-offs in major life history traits: earlier age at smoltification and sexual maturity have been reported among wild Romaine salmon (Belles-Isles et al., 2004; Fontaine et al., 2000; WSP Global, 2019), as well as in wild-born Romaine salmon reared in a hatchery environment at the LARSA (Laboratoire de Recherche en Sciences Aquatiques; Université Laval, Québec), whereas wild-born Puyjalon salmon have shown higher growth rates over several cohorts in the same hatchery conditions (T. Dion, Chayer, et al., 2020; T. Dion, Langlois-Parisé, & Proulx, 2020; Langlois-Parisé et al., 2018; Therrien et al., 2017). The persistence of such life history trait variation among cohorts in both wild and controlled environments strongly suggests heritable genetic variation likely linked to local adaptation, as the Romaine and Puyjalon rivers differ in spawning habitat quality, substrate and hydrological parameters (Belles-Isles et al., 2004; Fontaine et al., 2000; GENIVAR, 2002; Schieffer, 1975; WSP Global, 2019). However, the genetic basis of this putative local adaptation has yet to be investigated.

View Image - FIGURE 1. Overview of (a) polymorphism detection pipelines used for population-scale characterization of structural variants (SVs), single nucleotide polymorphisms (SNPs) and small indels within the genomes of Romaine (RO) and Puyjalon (PU) salmon (SR: short reads; LR: long reads), (b) location of the Romaine and Puyjalon rivers (red dot) in Québec, Canada, and (c) comparative genomics analyses performed on catalogued variants (FST, fixation index; GO, Gene Ontology; PCA, principal component analysis; RDA, redundancy analysis).

FIGURE 1. Overview of (a) polymorphism detection pipelines used for population-scale characterization of structural variants (SVs), single nucleotide polymorphisms (SNPs) and small indels within the genomes of Romaine (RO) and Puyjalon (PU) salmon (SR: short reads; LR: long reads), (b) location of the Romaine and Puyjalon rivers (red dot) in Québec, Canada, and (c) comparative genomics analyses performed on catalogued variants (FST, fixation index; GO, Gene Ontology; PCA, principal component analysis; RDA, redundancy analysis).

Here, we address this lack of knowledge by proposing a multiplatform, graph-based SV discovery pipeline across numerous genomes (Figure 1a) in order to catalog genetic polymorphism in Romaine and Puyjalon salmon, allowing us to investigate candidate adaptive variation within these populations. With this approach, we primarily targeted small (50–1000 bp) to intermediate-sized SVs (<5 kb), as direct SV calling based on short reads and long reads is more accurate and powerful in this range of length (Mahmoud et al., 2019). This study thus served as an unprecedented opportunity to characterize SVs, SNPs and small indels in North American Atlantic salmon, as well as to explore the relative contribution of various forms of genetic variation to fine-scale adaptation and population differentiation.

MATERIALS AND METHODS Sampling, DNA extraction and sequencing Short reads

Manipulations involving fish were authorized by the Comité de protection des animaux de l'Université Laval (permit number: 2021–783). Adipose fin clips were sampled from 60 wild-born adult salmon raised as broodstock at Université Laval's Laboratoire de Recherche en Sciences Aquatiques (LARSA) and stored in ethanol until use. The samples comprised 31 Puyjalon (16 males and 15 females) and 29 Romaine (14 males and 15 females) individuals.

Spin column DNA extractions were performed using Qiagen's DNeasy blood and tissue kit according to the manufacturer's protocol, with the exception of the elution step, which was done twice per sample with 50 μL of water. DNA quality was assessed by concentration measurement and migration on 1% agarose gel. DNA samples were then diluted to 10 ng/μl and sent to Génome Québec's Centre d'expertise et de services (Montréal, Canada) for library preparation and whole genome sequencing on an Illumina NovaSeq6000, using four S4 PE150 lanes for an anticipated depth of 16X per sample.

Long reads

Among the 60 fish sampled for whole genome short-read sequencing, four (one male and one female for each population) were used for Nanopore long-read sequencing. In order to provide intact high molecular weight DNA, whole blood was extracted from live fish using EDTA-prefilled syringes, followed by humane euthanasia by decapitation. Blood samples were flash-frozen in liquid nitrogen, transferred to storage tubes and stored at −80°C until use.

High molecular weight DNA extraction was performed twice for each fish using Circulomics' CBB protocol for nucleated blood (EXT-NBH-001; Circulomics, 2021), from 6 μL of blood mixed with 194 μL of ice-cold PBS. DNA quality was assessed by measuring concentration with Qubit and migrating DNA on a 0.5% agarose gel. DNA samples were then sent to the Centre for Integrative Genomics (CIGENE) at the Norwegian University of Life Science (NMBU) for sequencing. DNA fragments shorter than 25 kb were removed by size selection with Circulomics' Short Read Eliminator kit, and seven libraries were prepared for each sample using the SQK-LSK110 kit (Oxford Nanopore Technologies).

Sequencing was performed on a PromethION24 in short serial runs following protocol NFL_9076_v109_revA. Each sequencing run was terminated after a few hours, when the number of active pores dropped to below 10%, in order to recover pores by nuclease-flushing flow cells, which were then refilled with the same DNA preparation for a next short sequencing run. Two FLO-PRO002 flow cells were used for each sample, which were each filled with six and five loadings, respectively, in order to obtain an approximate coverage of 20X. Basecalling was done with Guppy version 5.0.13 (high-accuracy basecalling model) and raw reads were filtered for a minimum qscore of nine. The average yield for the four samples was 47.1 Gb of DNA, while the mean N50 was 39.5 kb.

Characterization of genetic variation Raw sequencing data preprocessing Short reads

The Ssal_Brian_v1.0 assembly, derived from a North American wild salmon from Newfoundland (Norwegian University of Life Sciences, 2022; GenBank assembly accession: GCA_923944775.1; project accession: CAKLZZ000000000.1), was used as the reference genome for all downstream analyses. This genome features 28 chromosomes with two known polymorphic rearrangements, i.e., the translocation of chromosome ssa01's p arm (ssa01p) to ssa23 (ssa01-23) (Lehnert et al., 2019), and the fusion of chromosomes ssa26 and ssa28 (Brenna-Hansen et al., 2012).

Raw Illumina data was processed using the wgs_sample_preparation pipeline (https://github.com/enormandeau/wgs_sample_preparation). Adapters and low-quality ends were first trimmed from raw reads by running fastp 0.20.0 (Chen et al., 2018) with default parameters. Trimmed reads were then mapped to the indexed reference genome (samtools faidx command, version 1.8; Danecek et al., 2021) using BWA MEM (Li, 2013), allowing a minimum mapping quality of 10 (−q 10). Duplicate reads were filtered out of the alignment with MarkDuplicates (Picard 1.119; Broad Institute, 2019). After indexing the resulting bam files with Picard BuildBamIndex, mapping was refined around candidate indels using GATK 3.6–0 RealignerTargetCreator and IndelRealigner (McKenna et al., 2010) and overlapping read pairs were clipped to preserve read regions with the highest average quality using bamUtil 1.0.14 clip overlap (Jun et al., 2015). Finally, we used samtools addreplacerg to add unique read group names for each sample's bam file, which is a requirement for some variant calling tools we used.

Long reads

Since each sequencing run produced multiple raw read files, all fastq files obtained for a given sample were first concatenated to yield a single fastq file per sample. Raw reads were filtered for an average minimum quality of 10 and a minimum read length of 1000 bp using NanoFilt 2.0.8 (De Coster et al., 2018). We mapped filtered reads to the Ssal_Brian_v1.0 assembly with Winnowmap version 2.03 (Jain et al., 2020, 2022) using default parameters and a k-mer size of 15 (−k 15). The complete preprocessing pipeline (ONT_data_processing v1.0.0) can be found at https://github.com/LaurieLecomte/ONT_data_processing.

SNP and short indel (1–50 bp) calling

SNPs and small indels were called exclusively from short-read data, as higher basecalling error rates in long-read data are likely to interfere with SNP detection (Ahsan et al., 2021; Rang et al., 2018). Variant calling was performed in all 60 samples at once and for each chromosome separately, using bcftools mpileup and call (version 1.16) and requiring a minimum mapping quality of five at a given site (−q 5). The 28 single chromosome VCF files were then concatenated with bcftools concat.

In order to apply the same filtering criteria as SVs (described below), samples without at least four supporting reads and a minimum genotype quality of five for a given variant, or that had more than five times the anticipated whole genome short-read sequencing coverage (80) or an exceedingly high genotype quality (GQ = 127), were assigned the genotype “missing” (“./.”) using bcftools +set-GT (version 1.15). Finally, we kept SNPs and small indels that had a minor allele frequency between 0.05 and 0.95 and that were genotyped in at least 50% of samples (i.e., population-scale filters), using bcftools filter (version 1.13). The full SNP and indel calling pipeline is available at https://github.com/LaurieLecomte/SNPs_indels_SR (version v1.0.0).

Structural variant calling Short reads

In order to alleviate some of the challenges inherent to SV detection (e.g., low precision), we proposed an ensemble approach where SVs were first called independently with three separate tools, then merged across tools in order to obtain a union callset, which we filtered for calls supported by at least two callers. We assumed that SVs confidently called by multiple tools are more likely to be true positives than SVs called by a single tool.

The three callers used in combination were chosen based on reported performance in previous studies and benchmarks (Cameron et al., 2019; Kosugi et al., 2019; Mérot et al., 2023; Stenløkk, 2023). Each caller was provided with the same 60 bam files, as well as the reference genome used for mapping reads. We first ran DELLY (version 1.1.6; Rausch et al., 2012) following guidelines for germline calling in high coverage genomes (https://github.com/dellytools/delly#germline-sv-calling). Putative SVs were first called separately in all samples and in each of the 28 chromosomes, then merged together in order to obtain a list of known SV sites to be genotyped by DELLY in each sample. Genotyped SVs were then merged across all samples into a unified, multisample VCF. We filtered for deletions (DEL), insertions (INS), duplications (DUP) and inversions (INV) labelled as PASS and PRECISE using bcftools 1.13 filter. Next, we used Manta version 1.6.0 (Chen et al., 2016) according to instructions for germline joint samples analysis (https://github.com/Illumina/manta/blob/master/docs/userGuide/README.md#germline-configuration-examples). We parallelized SV calling across the 28 chromosomes instead of across samples, since Manta has no built-in procedure for merging calls across samples. SVs tagged as BND (breakends) were converted into explicit inversions using the script convertInversion.py provided in Manta's installation directory. The 28 chromosome-specific VCFs were then concatenated into a single multisample file, which was filtered for PASS and PRECISE calls as well. The last short-read-based caller included in the pipeline was LUMPY (Layer et al., 2014), through smoove (version 0.2.7; Pedersen et al., 2020). Following recommendations for population-level calling (https://github.com/brentp/smoove#population-calling), we called SVs in the same manner as with DELLY. Only DEL, DUP, INV calls labelled as PRECISE were retained.

The three SV sets were then merged together using Jasmine version 1.1.5 (Kirsche et al., 2023), which integrates various information including chromosome, position, end, size and type to determine whether SV calls from different files or samples refer to the same SV or not. We ran Jasmine with parameters “--mutual_distance --max_dist_linear = 0.25”, so that the maximum allowed distance required between two SVs for them to be merged is correlated with their size. The merged VCF was then edited with a custom R script (R version 4.1.2; R Core Team, 2021) in order to convert symbolic alternate alleles to explicit sequences and to standardize VCF fields. The formatted merged VCF was finally filtered for calls supported by at least two callers, with bcftools filter. Moreover, in accordance with the most prevalent definition of SVs (Feuk, Carson, & Scherer, 2006), variants smaller than 50 bp were considered as small indels instead of SVs and were therefore filtered out. This first set of merged SVs will be referred to as the short-read SV set (SR SVs). The detailed short-read SV calling pipeline can be found at https://github.com/LaurieLecomte/SVs_short_reads (version v1.0.0).

Long reads

The SV calling procedure for Oxford Nanopore data is equivalent to the pipeline described above for SV detection from short reads, i.e., independent SV detection with three different tools, merging of SV calls across callers, and filtering for calls supported by at least two tools. However, since most long-read-based SV callers do not support multisample calling, SVs were first called separately for each sample using all three chosen tools, then merged across samples (across-sample merge) to obtain a single VCF per caller. The three multisample VCFs were then merged together (across-caller merge) to obtain the long-read SV set (LR SVs). The pipeline is available at https://github.com/LaurieLecomte/SVs_long_reads (version v1.0.0).

We ran Sniffles 2.0.7 (Sedlazeck, Rescheneder, et al., 2018; Smolka et al., 2022) (default settings and “--output-rnames --combine-consensus” options) on each sample and filtered for PASS and PRECISE calls. We then refined alternate allele sequences and breakpoints for insertions, deletions and some duplications by running Iris (Kirsche et al., 2023): we first preprocessed each sample's VCF with Jasmine (“--dup_to_ins --preprocess_only”), and ran Iris with parameters “--keep_long_variants --also_deletions”. The four samples' refined VCFs were merged together using Jasmine --ignore_strand --mutual_distance --allow_intrasample --output_genotypes, and refined SVs were then converted back to their original type with Jasmine --dup_to_ins --postprocess_only. The multisample Sniffles VCF was finally filtered again for PASS and PRECISE insertions, deletions, duplications and inversions. SVs were also called with SVIM 2.0.0 (Heller & Vingron, 2019) using parameters “--insertion_sequences --read_names --max_consensus_length=50000 --interspersed_duplications_as_insertions”, following the same procedure as Sniffles to produce a multisample VCF with PASS calls. Last, we used NanoVar 1.4.1 (Tham et al., 2020) with default settings. Supporting reads' names were added manually using a custom R script in order to allow for the refinement of SV breakpoints by Iris. The three multisample VCFs were finally merged together with Jasmine, formatted with custom R scripts and filtered, as described for the SR SV set.

Combination of SV datasets

A final merging step was performed for combining the short-read and long-read SV sets using Jasmine with parameters “--ignore_strand --ignore_merged_inputs --normalize_type --output_genotypes”, resulting in a large union set of candidate SVs to be genotyped in the 60 Atlantic salmon genomes (https://github.com/LaurieLecomte/merge_SVs_SRLR; version v1.0.0).

SV genotyping

We implemented a graph-based genotyping pipeline (https://github.com/LaurieLecomte/genotype_SVs_SRLR, version v1.0.0) using the vg toolkit version 1.46.0 (Hickey et al., 2020), following recommendations from https://github.com/vgteam/vg/wiki/SV-genotyping-with-vg#sv-genotyping-with-vg-call. We first built an indexed reference graph structure from the reference genome fasta and the SV VCF file using vg autoindex, then computed snarls, i.e., sites of known variation in the genome graph, with the vg snarls command. We then remapped short reads to the variant-aware reference graph for all samples separately using vg giraffe (Sirén et al., 2021), computed read support for variation sites (vg pack), then genotyped these sites (vg call). We used bcftools +set-GT on each sample's VCF to set the genotype as missing (“./.”) for calls that were not supported by at least four reads and that had a quality score lower than five, or that had an extreme quality score (GQ = 256) or an extreme depth (DP = 80), as such calls tend to be false positives (Cameron et al., 2019). All 60 sample VCFs were merged together with bcftools merge. The genotyped SV set was finally filtered for variants with a minor allele frequency between 0.05 and 0.95, and less than 50% missing data. The 50% missingness threshold was arbitrary and based on Mérot et al. (2023). Comparison with both less and more stringent missing data proportion thresholds showed that the choice of threshold did not impact the post-filtering variant count differently for SVs than for SNPs or indels (Table S1).

In an effort to better link the genomic context and the confidence in SV calling, we compared the frequency distributions of both high-quality SVs that passed all filtering steps and low-quality, filtered out SVs in two genomic features known to interfere with SV calling: highly similar regions resulting from whole-genome duplication events (e.g., syntenic regions) and repeated content (repeats and transposable elements). To identify syntenic regions, we followed the steps described by Dallaire et al. (2023): in summary, we aligned the genome to itself with nucmer (built-in mapper in MUMmer version 4.0.0; Marçais et al., 2018), then performed synteny analysis with SyMAP (Soderlund et al., 2011), and re-mapped syntenic blocks to the genome with LASTZ (version 1.04.15; Harris, 2007) to get the homology percentage. We identified repeats and transposable elements using RepeatMasker (version 4.0.8; Smit et al., 2013). To distinguish between probable false positive SVs and probable true positive SVs, we relied on the filtering criteria we applied on the sample level (read depth and genotype quality) and on the population level (on minor allele frequency and missing data proportion) during the genotyping procedure. Using bedtools window, we then extracted excluded and filtered SVs overlapping with either a syntenic region or a repeat region (within a 100-bp window), or both.

In addition, because information on putative SVs is lost at the genotyping step, we applied the procedure described in Methods S1 to match genotyped SVs with a known putative SV based on the correspondence of position and allele length, in order to retrieve information on variant type, length and platform support (e.g., short- and/or long-read). This information allowed us to perform additional analyses on the set of genotyped and matched SVs. Indeed, in order to see if long-read SVs could be reliably genotyped from short-read data and a pangenome, we examined the concordance between the genotypes called by vg for long-read SVs and the genotypes called by long-read-based SV callers prior to merging datasets across platforms. First, for the four samples sequenced with both short- and long-read platforms, we extracted the three genotypes (from Sniffles, SVIM and NanoVar) for each long-read SV. We then determined the consensus genotype when possible, e.g., if at least two callers called the same genotype in a given sample. Each SV was labelled as concordant when its consensus genotype matched the corresponding genotype outputted by vg, or as non-concordant if its consensus genotype differed from the vg genotype. Alternatively, when all three callers outputted different genotypes for a given SV in a given sample, no consensus genotype could be inferred, and therefore the concordance between callers and vg could not be determined. We also performed this procedure for short-read SVs for comparison purposes. This concordance analysis is detailed in the scripts compare_GTs_LR_vs_vg.sh and compare_GTs_LR_vs_vg.R from the genotype_SVs_SRLR pipeline.

Population genomics analyses Differentiation between the Romaine and Puyjalon populations

We used ANGSD version 0.937 (Korneliussen et al., 2014) for performing various population and comparative genomics analyses on SVs, SNPs and small indels separately by adapting a previous pipeline designed for SNPs (https://github.com/clairemerot/angsd_pipeline). To investigate population structure, we first performed principal component analysis (PCA) on a normalized covariance matrix produced from input VCF files using VCFtools (version 0.1.16; Danecek et al., 2011), pcangsd (Meisner & Albrechtsen, 2018) and custom R scripts. From input VCF files, we then estimated average genome-wide fixation index (F_ST; Weir & Cockerham, 1984) from each population's allele frequency spectrum using ANGSD's -doSaf and realSFS functions (version 0.937). We also computed F_ST along the genome by sliding windows of 100 kb (per-window F_ST), as well as for each variant (per-variant F_ST).

We employed two complementary approaches for identifying candidate variants likely involved in local adaptation (Figure 1c). We first extracted the most highly differentiated variants falling within the upper 97% per-variant F_ST quantile (F_ST outliers). We also performed Fisher's exact tests on per-population allelic counts at each site and identified outliers with a corrected p-value (q-value) lower than 0.01 (Benjamini & Hochberg, 1995). We then extracted common outliers between F_ST and Fisher exact tests to yield a set of strongly differentiated variants used for further analysis.

Second, we ran a redundancy analysis (RDA) on the imputed genotype matrix, with the population as the only explanatory variable using the R package vegan (Oksanen et al., 2022). While F_ST and Fisher's exact test are more likely to detect outlier loci of large effect, RDA allows identifying covarying markers with individually weak effect that may be involved in polygenic control of phenotypic expression (Forester, Lasky, et al., 2018; Rellstab et al., 2015), as previously documented for life history traits such as age at sexual maturity or growth rate (Debes et al., 2021; Sinclair-Waters et al., 2020). We defined RDA candidates as variants with loadings falling over the three standard deviations threshold (Forester, Laporte & Manel, 2018). We thus obtained a set of outlier variants and a set of candidate variants for each of the three variant types studied.

Functional analysis of candidate genomic variation

In order to assess the potential functional impact of candidate variants on life-history trait variation observed in Romaine salmon, we investigated the overlap between variants of interest and known genes. We first annotated the Ssal_Brian_v1.0 assembly using the pipeline GAWN v0.3.5 (https://github.com/enormandeau/gawn) based on the transcriptome of the Ssal_v3.1 assembly (GenBank assembly accession: GCF_905237065.1) and filtered out possible duplicate annotations, which produced a list of 36,697 known genes.

For each set of variants of interest, we ran bcftools window (version 2.30.0; Quinlan & Hall, 2010) to identify a set of overlapping genes located within 10 kb of at least one variant. We then performed Gene Ontology (GO) enrichment analysis on each gene set with goatools 1.2.3 (Klopfenstein et al., 2018), using the list of 36,697 of genes from GAWN as the background (population) set and the go-basic database version 1.2 (2022-07-01; http://release.geneontology.org/2022-07-01/ontology/go-basic.obo). Only enriched terms that referred to a biological process (BP) and with a corrected p-value (Benjamini & Hochberg, 1995) under 0.1 were considered significant and preserved. We then used REVIGO (Supek et al., 2011) to cluster significant GO terms by semantic similarity with a cutoff value of 0.5 (“small list”), for easier interpretation. All scripts used for population genomics analyses can be found at https://github.com/LaurieLecomte/SVs_SNPs_indels_compgen (version v1.0.0).

RESULTS Long reads revealed more variants while short reads allowed population-scale SV genotyping

SVs were identified through our multistep calling procedure involving both short- and long-read data, where different callers and datasets showed high variability in the number, types and sizes of SVs detected. Indeed, short reads revealed mostly deletions, whereas long reads allowed the detection of many more SVs, especially deletions, insertions and duplications. Among short-read-based callers, Manta reported the most SVs of various types and sizes (151,103), while smoove called the least (28,164), almost exclusively deletions (Table 1; Figure S1). In total, 238,492 SV calls were merged across the three callers, of which only 15.5% (37,041) were shared by at least two tools and were longer than 50 bp: this short-read SV set primarily consisted of deletions (34,761) smaller than 100 bp, and very few duplications (318) and insertions (849) (Table 2; Figure S2).

TABLE 1 Number of SVs reported by each short-read-based caller, and number of SV calls merged across these callers.

[Table omitted. See PDF]