Content area
The evolutionary divergence of freshwater and marine fish reflects their adaptation to distinct ecological environments, with differences evident in their morphological traits, physiological functions, and genomic structures. Traditional molecular methods often fail to uncover the intricate regulatory relationships among genes under environmental stress. This study proposes the weighted attention gene analysis (WAGA) model, a novel approach that integrates natural language processing (NLP) for protein-coding gene feature representation with deep learning and self-attention (SA) mechanisms. WAGA effectively identifies key genes associated with sensory functions, osmoregulation, and growth and development on the basis of attention weights. The experimental results highlight its effectiveness in revealing genes crucial for ecological adaptation and evolution. This approach is essential for elucidating the mechanisms of ecological adaptability and evolutionary processes, while also offering novel insights and tools to support targeted breeding in aquaculture and fish genomics research.
Introduction
The evolution of jaws represents a pivotal event in the history of vertebrates, and is characterized by the transformation of gill arches into mandibular arches [1]. Fish constitute the largest group of vertebrates, comprising more than one-half of the world’s living vertebrates [2]. Presently, more than 37,000 fish species have been identified globally [3], their common ancestors appeared in the ocean approximately 500 million years ago in the Devonian period [4, 5], and the fossil record can be traced back to the early Cambrian [6, 7]. The occurrence of phenomena such as geographical alterations and climate change has led to the isolation of specific fish species within particular aquatic ecosystems, resulting in divergence between marine and freshwater habitats [8, 9]. Concurrently, both freshwater and marine fish have evolved different salinity regulatory mechanisms to adapt to their living habitats, such as the gill barrier [10]. The food chains and ecological niches in different environments have further driven the diverse evolution of fish, enabling them to adapt to various ecosystems and develop unique physiological traits [11, 12]. These studies have revealed the intricate evolutionary processes of fish in freshwater and marine environments, reflecting the important role of their ecological adaptation and physiological mechanisms.
Owing to their remarkable species diversity and morphological variation, fish have become a central focus of scientific research [2]. They are crucial for advancing scientific knowledge and fostering the growth of the aquaculture industry [13]. In parallel, the sequencing of fish genomes and transcriptomes has provided invaluable resources for the ichthyological research [14]. However, the evolutionary processes of fish are shaped by a complex array of physiological and molecular mechanisms. These factors not only influence their physiological functions but also regulate gene expression and enable adaptive changes at the cellular level. Previous studies have shown that growth hormone (GH) [15, 16], insulin-like growth factor I (IGF-I) [15, 17], glucocorticoid receptor (GR) [18, 19], 11-deoxycorticosterone (DOC) [20], and thyroid hormone (TH) [21] support fish adaptation to marine environments, whereas prolactin (PRL) [15, 16, 18, 22] plays a role in their adaptation to freshwater habitats.
Furthermore, quantitative trait locus (QTL) mapping and genome-wide association studies (GWAS) have achieved remarkable success in the molecular breeding of salt-tolerant, high-quality Nile tilapia [23]. At the same time, genomic technologies and breeding strategies— including SNP arrays, genomic selection, and genome editing—have greatly accelerated genetic improvement by integrating functional genomic insights directly into breeding activities [13]. However, the vast diversity of fish species and the wide range of their habitats lead to pronounced genomic structural variation, which continues to hinder in-depth functional gene analysis and the cross-species deployment of these breeding technologies.
Recent advancements in biotechnology and artificial intelligence have led to the widespread use of traditional methodologies for biological data mining and analysis, which play crucial roles in bioinformatics processing and biological knowledge discovery [24]. Key techniques in biological data mining include sequence analysis [25], data clustering [26], and association rule mining [27]. In contrast, data analysis approaches primarily involve gene expression profiling [28], gene and protein structure prediction [29], and weighted gene coexpression network analysis (WGCNA) [30]. These methods have been extensively applied in genomics research [31], disease studies [32], and evolutionary biology [33]. However, traditional data mining and statistical analysis methods often fail to handle the complexity of large-scale genomic data and encounter difficulties in accurately modelling nonlinear relationships. Additionally, the constructed networks do not fully capture the true relationships between genes, hindering the identification of key genes.
In comparison, deep learning algorithms have demonstrated exceptional performance in various bioinformatics tasks [34], including gene expression inference [35], protein structure prediction [36], protein classification [37], enhancer prediction [38], and RNA-Seq gene expression profile analysis [39]. These successes are attributed to the algorithms’ ability to automatically extract abstract features from raw data, eliminating the need for manual feature engineering. The self-attention mechanism (SA) [40], a key component of deep learning, allocates attention dynamically to different positions in the input data on the basis of learned weights. This allows the model to prioritize relevant information for the task by assigning higher weights to important data and lower weights to less significant data. Consequently, the attention mechanism enables the model to focus on the most relevant parts of the input sequence, enhancing its capacity to process and utilize key information for subsequent network computations. Despite these advancements, leveraging artificial intelligence, particularly deep learning, to connect environmental adaptation and evolutionary mechanisms with the underlying key genes remains a significant scientific challenge [41, 42].
In this study, we developed WAGA, a deep learning-based method with an attention mechanism designed to identify key genes involved in freshwater and seawater adaptation in fish. WAGA assigns different weights to genes, reflecting their relative importance in adaptive evolutionary processes. Using 128 species of Actinopterygii as a case study, we applied WAGA to identify high-weight genes and performed KEGG and GO enrichment analyses based on these genes. The results provide valuable insights into the molecular mechanisms underlying adaptive evolution in fish and have practical relevance for aquaculture.
Results
Data preprocessing results
Based on the original genomic data of 233 fish species, 153 species with genome completeness exceeding 90% were selected following BUSCO evaluation and sequence length analysis, with an average completeness of 94.9% (Supplementary File S1). Orthologous gene clustering identified a total of 14,206 representative orthogroups (Supplementary File S2). Habitat information for these 153 species was obtained from the FishBase database, comprising 66 freshwater species, 62 marine species, and 22 species inhabiting both freshwater and marine environments. An additional 3 species could not be classified due to the absence of habitat records in FishBase (Supplementary File S3). Consequently, 128 species with clearly defined habitat labels were selected as the final subjects for analysis (Supplementary File S4 and Table S1).
Subsequently, we performed a statistical analysis of k-mer frequencies across 153 species (Supplementary File S5). Under the condition of retaining 75% of the corpus, the frequency threshold was set to 289.5, resulting in a vocabulary size of 8,417. When using the BPE tokenization method, a Minimum frequency of 300 was set, corresponding to a vocabulary size of 9,000. According to the BPE tokenization results (Supplementary File S6), the same Minimum frequency of 300 was used in Word2Vec training, thereby preserving 99.98% of the corpus. During the data augmentation stage, 16 samples were generated for each species, expanding the original 128 samples to a total of 2,048 training samples (Supplementary File S7). The length distribution of sequences generated from the three sampling batches is shown in Fig. 1.
[IMAGE OMITTED: SEE PDF]
WAGA model results
In this study, the dataset was split in a 7:2:1 ratio, resulting in 1,433 samples for the training set, 410 samples for the validation set, and 205 samples for the test set. For each sample, 21,000 sequences were extracted and encoded into 100-dimensional vectors as input to the model for training. As shown in Fig. 2, during the early stages of training, the model quickly learned effective features, leading to a significant improvement in validation accuracy. However, as training continued, the model gradually overfitted the training data, causing a decline in validation accuracy. By applying optimization strategies such as learning rate adjustment, the model progressively learned more generalized features in subsequent training, resulting in a recovery and convergence of validation accuracy, demonstrating strong learning capability and generalization performance.
[IMAGE OMITTED: SEE PDF]
In the test set, as shown in Table 1, the three WAGA models (WAGA1, WAGA2, and WAGA3) all demonstrated excellent performance across four evaluation metrics: accuracy, recall, precision, and F1 score, each approaching 99%. Among them, WAGA1 achieved the highest scores in all metrics, with an F1 score of 99.51%, indicating the best overall performance. WAGA2 and WAGA3 also showed stable results, with F1 scores of 99.01% and 99.02%, respectively. Overall, the WAGA models maintained a high level across all evaluation metrics, confirming their effectiveness and robustness for this task.
[IMAGE OMITTED: SEE PDF]
Among the 128 real species, as shown in Table 2, all four evaluation metrics exceed 90%, indicating that the model can effectively distinguish between freshwater and saltwater fish based on genomic data. The model’s ROC curve is positioned near the top-left corner, indicating a strong discriminative ability. The area under the curve (AUC) is consistently above 0.98, approaching 1, which further validates the model’s excellent classification performance on this task. Detailed recognition results are presented in the confusion matrix (Supplementary Fig S4).
[IMAGE OMITTED: SEE PDF]
Hyperparameter sensitivity analysis
To evaluate the robustness and stability of the model under different training configurations, we performed a hyperparameter sensitivity analysis using WAGA1 as the baseline. Specifically, we varied one hyperparameter at a time—learning rate, optimizer type, number of GRU layers, and hidden dimension—while keeping all other settings fixed. The results on the test set are summarized in Supplementary Table S4. The learning rate was reduced by a factor of 10 every 10 epochs. Due to GPU memory constraints, the batch size was fixed at 4. Early stopping was employed based on validation accuracy, terminating training if no improvement was observed for 5 consecutive epochs. Since we only used one layer of the GRU network, and dropout needs to be used in two or more layers to be effective, we did not analyze dropout.
The results show that the model achieved the best performance when the learning rate was set to 0.01, the optimizer was Adam, the GRU consisted of a single layer, and the hidden dimension was set to 8, yielding an accuracy and F1-score of 0.9951. In contrast, reducing the learning rate to 0.001 or switching to the SGD optimizer led to a substantial drop in performance. Similarly, increasing the number of GRU layers or the hidden dimension degraded generalization, possibly due to overfitting or training instability caused by higher model complexity.
Model comparison resluts
As shown in Table 3 and Table 1, WAGA1 significantly outperforms all baseline approaches across all four evaluation metrics on the test set. In particular, WAGA1 achieves an F1-score of 0.9951, while the best-performing baseline (GRU) reaches 0.9705, and LSTM reaches 0.7744. Traditional machine learning models perform less competitively, with Logistic Regression, SVM, and BiLSTM achieving F1-scores below 0.60. These results confirm the superior performance of WAGA in this task, particularly in terms of maintaining high values across all key metrics.
[IMAGE OMITTED: SEE PDF]
Enrichment analysis results
Based on the three weight files, \(GWS_4\) contains 43 groups (Supplementary File S8). GO enrichment analysis was performed on the high-weight genes within these 43 groups, and the top 15 terms were selected for further analysis. The results showed that these genes were significantly enriched (\(p < 0.05\)) across all three major GO categories: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP) (Fig. 3). Detailed information can be found in Supplementary File S9.
[IMAGE OMITTED: SEE PDF]
In the CC category, these genes are mainly enriched in membrane-related structures, such as the integral component of the membrane, membrane part, cell surface, and cell-cell junction. This result indicates significant differences between saltwater and freshwater fish in terms of transmembrane transport, cell communication, and barrier functions, which are closely related to their physiological needs for osmoregulation, salt exchange, and adaptation to aquatic environments. In BP category, significantly enriched processes include signal transduction, neurological system process, visual perception, response to stimulus, and sensory perception of light stimulus. These results suggest that the identified genes are extensively involved in the perception and transduction of external signals, the development of the visual system, and the response to environmental stimuli. This indicates that variations in light conditions, chemical signals, and ion concentrations between marine and freshwater environments may have exerted selective pressure and driven adaptive evolution of these pathways in fish inhabiting different habitats. In the MF category, the genes were mainly enriched in functions such as structural constituent of eye lens, olfactory receptor activity, signaling receptor activity, transmembrane signaling receptor activity, and gamma-aminobutyric acid: sodium symporter activity. This further indicates that these protein-coding genes play crucial roles in the functioning of sensory organs (such as olfaction and vision), signal recognition, and transmembrane ion transport. These functions may influence the ability of fish to detect and respond to food sources, mating cues, and danger signals in different aquatic environments.
The KEGG enrichment analysis revealed that the high-weight genes were mainly enriched in pathways such as Neuroactive ligand-receptor interaction (\(-log10P\)=11.80), Cell adhesion molecules (\(-log10P\)=10.30), Tight junction (\(-log10P\)=7.37), Amino sugar and nucleotide sugar metabolism (\(-log10P\)=5.01), and Lysosome (\(-log10P\)=1.29) (Fig. 4).
[IMAGE OMITTED: SEE PDF]
The Neuroactive ligand-receptor interaction pathway is one of the most significantly enriched pathways in this study. It includes multiple olfactory receptor-related genes, such as members of the taar gene family. These receptors are primarily involved in neural signal transmission and chemosensation. In aquatic environments, they may help fish perceive changes in salinity, temperature, and chemical composition, thereby regulating behavior and physiological functions to enhance environmental perception and adaptation to either seawater or freshwater conditions. The co-enrichment of the Cell adhesion molecules and Tight junction pathways highlights the critical role of adhesion molecules and tight junction proteins (such as the cldn gene family) in environmental adaptation. Tight junctions play a vital role in maintaining the epithelial barrier function of fish gills and intestines, contributing to the regulation of ion exchange. Multiple chia genes were enriched in the amino sugar and nucleotide sugar metabolism pathway. The proteins encoded by these genes can degrade chitin and utilize its metabolic products for glycosylation, energy supply, and immune defense. Genes involved in the lysosome pathway, such as litaf and other unknown genes (e.g., si:ch211-202h22), are associated with autophagy and immune responses. The differential expression of this pathway between freshwater and marine fish reflects differences in immune system regulation as they respond to environmental stressors such as osmotic pressure changes and pathogen challenges. Detailed information can be found in Supplementary File S10.
Materials and methods
Method details
The methodology of this study, which includes (a) datasets, (b) data preprocessing, (c) WAGA model construction, (d) weight fusion, and (e) assessment of key genes, is outlined in Fig. 5. Supplementary Table S2 and S3 provide the complete model configuration parameters and installation instructions for the required dependencies. All experiments were conducted on a platform equipped with an NVIDIA A100 80GB GPU and two AMD EPYC 7402 24-core processors.
[IMAGE OMITTED: SEE PDF]
This study utilizes protein-coding genes from 153 fish species as raw data and expands the BPE tokenization corpus with processed data to comprehensively explore the implicit relationships between amino acids. Owing to the inability to determine labels for 25 species, model training focuses on 128 fish species, classified by their respective freshwater and saltwater habitats. The task of identifying key genes in freshwater and saltwater fish is framed as a text classification problem within NLP. Using the WAGA model, protein sequences are transformed into vector representations that capture hierarchical relationships, which are subsequently input into a neural network with an attention mechanism for classification. During this process, the weight of each gene is calculated.
Data collection
In this study, we obtained information on a total of 2,267 Actinopterygii species from the National Center for Biotechnology Information (NCBI) public genome database, of which 233 species have complete genomic data available for download. The downloaded datasets included protein sequences (FASTA format) and corresponding gene annotation files (GFF format) for each species. The habitats associated with these fishes were retrieved from the FishBase. Based on habitat type, these species were classified into two groups: freshwater and marine fishes.
Data preprocessing
During the data preprocessing phase, the specific process is shown in Fig S1. Technical sequencing quality metrics can complement the quantification of genomic dataset completeness by the expected gene content based on Benchmarking Universal Single-Copy Ortholog (BUSCO). To ensure the completeness and redundancy of the genomic data for the 233 fish species, the reliability of the genomic quality was assessed by identifying conserved single-copy orthologues in the genome through BUSCO using the Actinopterygii_odb10 database as a reference.
Since the protein sequences downloaded from NCBI contain pseudogenes, we need to extract protein-coding genes on the basis of the relevant information in the GFF files. The quality of protein sequence data directly affects the identification of orthologous gene clusters and the accuracy of inference of orthologous genes, so we also need to eliminate duplication (Only the longest sequence is retained for each gene ID) as well as sequence length analysis for each species to remove redundant sequences to facilitate the subsequent construction of word hierarchies.
In addition, to make the later data augmentation more convincing, we used OrthoFinder to perform orthologous gene analysis on the fish genome data screened by the above process to identify orthogroups (groups) between different fish species and infer orthologous gene relationships. Some groups were thought to have contributed little to the evolutionary difference between freshwater and marine fish since they included only homologous genes from a small number of species.
WAGA model construction
This study uses fish genomic data to establish the WAGA classification model, which identifies key genes associated with the aquatic environment of fish species. The model improves the data’s representation capabilities by performing a semantic understanding of species gene data at several semantic levels. During model training, we adjust the hyperparameters of the model and add an adaptive learning rate adjustment strategy and an early stopping strategy to ensure efficient model training. The structure of the WAGA classification model is illustrated in Fig. 6. The entire model consists of five main components: word embedding, semantic feature representation, contextual feature extraction, sentence-level attention, and Softmax classification.
[IMAGE OMITTED: SEE PDF]
Word embedding
Byte Pair Encoding (BPE) and k-mer are two popular tokenization methods. In bioinformatics, k-mer was first widely used to process biological sequences such as DNA and RNA [43]. However, as k increases, the k-mer feature space grows exponentially, increasing the computational complexity and making it harder to capture long-range relationships. However, BPE is a more flexible tokenization methodology that successfully addresses typical problems, including the sparsity of vocabulary in multilingual processing and out-of-vocabulary (OOV) terms in conventional tokenization strategies. BPE has been increasingly used in bioinformatics in recent years, especially in protein sequence analysis. We trained a BPE model in this study using protein genomic data from 153 fish species. A hierarchical vocabulary is eventually formed by gradually creating longer subword units by iteratively combining the most common amino acid pairs in protein sequences. Tokenizers based on the BPE method are then used to break down lengthy protein sequences into the smallest units that have a pattern or semantic meaning (Fig S2a).
Word embedding and one-hot encoding are two distinct word representation methods in NLP. Word embedding maps words in text into a continuous vector space, preserving the original sequential information and capturing contextual relationships between sentences. In contrast, one-hot encoding generates one-dimensional vectors where only a single value is 1 and all others are 0. This approach leads to high dimensionality and sparsity, with zero similarity between words and sentences. Word2vec, a word embedding model based on neural networks introduced by Mikolov et al. (2013) [44], provides two distinct approaches for generating word representations: the Skipgram (Fig S2c) and continuous bag of words (CBOW, Fig S2b) models. The CBOW architecture is used in this study to learn word vector representations because of its benefits, which include its capacity to facilitate context learning, capture semantic links, and train effectively. The sigmoid function is used in the output layer to replace the softmax function with negative sampling to lower computing costs while enhancing the quality of vector representations. The subword units segmented by BPE (one or more amino acids) are converted into dense vector representations using CBOW.
In CBOW word embedding, assuming a window size of 2k, the average vector of context words (hidden layer vector) h is calculated as shown in Eq. 1 for the target word with context words \(W_{(t-k)}, \ldots , W_{(t-1)}, W_{(t+1)}, \ldots , W_{(t+k)}\). The goal of the output layer is to compute the score using the dot product of the embedding vectors of each word in the vocabulary with the average vector of the context word as in Eq. 2.
$$\begin{aligned} h = \frac{1}{2k} {\textstyle \sum _{i=-k}^{k}} {W^{T}}{x_{t+i} } \quad (k \ne 0) \end{aligned}$$
(1)
$$\begin{aligned} u_{t} = W^{\prime T} h \end{aligned}$$
(2)
where \(W^T\) is the transpose of the word embedding matrix, \(x_{(t+i)}\) is the one-hot encoding of each context word, and \({W^T}x_{(t+i)}\) denotes the embedding vector of each context word. \(W^{\prime T}\) is the transpose of another embedding matrix denoting the weights of the output layer.
The core idea of negative sampling is to shift the optimization objective of the loss function from "predicting the distribution of the entire vocabulary" to "accurately distinguishing between positive and negative samples." Positive samples consist of pairs formed by the target word and its context word, denoted as \((W_t, W_{content})\), where \(W_{content}\) represents the context word. For each positive sample M words unrelated to the context are randomly sampled from the vocabulary as negative samples. The sampling process is guided by the word frequency distribution, with the sampling probability calculated via Eq. 3. The loss function for negative sampling is defined as shown in Eq. 4.
$$\begin{aligned} P(\omega ) = \frac{f(\omega )^{\frac{3}{4}}}{ {\textstyle \sum _{\mu \in V} {f(\mu )^\frac{3}{4} }}} \end{aligned}$$
(3)
$$\begin{aligned} L = -(log(\sigma (V_{W_{content}}\cdot V_{W_{t}}))+ {\textstyle \sum _{i=1}^{M}log(\sigma (-V_{W_{content}}\cdot V_{W_{neg,i}})}) \end{aligned}$$
(4)
where \(P(\omega )\) represents the probability of selecting word \(\omega\) as a negative sample, \(f(\omega )\) denotes the frequency of word \(\omega\), \(\textstyle \sum _{\mu \in V} {f(\mu )^\frac{3}{4} }\) represents the sum of all word frequencies to the \(\frac{3}{4}\) power, and \(V\) is the size of the vocabulary. The exponent \(\frac{3}{4}\) is an empirically determined value that effectively balances the sampling probabilities of high-frequency and low-frequency words.\(V_{W_{content}}\) is the embedding vector of the context word, \(V_{W_{t}}\) is the embedding vector of the target word, and \(V_{W_{neg, i}}\) is the embedding vector of i negative sample. Finally, \(\sigma (x)\) refers to the sigmoid function.
Data augmentation
The species data were randomly sampled after BPE disambiguation to achieve data expansion (Fig. 7), and we ensured that the number of samples was balanced between freshwater and marine fish following expansion to satisfy the data training criteria for deep learning. The basic premise of this sampling method is to randomly select 14,000 groups each time, and then sample each species on the basis of the homologous genes in the groups. This process is repeated until the desired sampling number is obtained. This method can uniformly input all sequences of each species into the model for weight calculation, which not only ensures the integrity of the information but also effectively solves the problem of limited video memory resources.
[IMAGE OMITTED: SEE PDF]
Word hierarchy construction and gene vector representation
Words are the smallest semantic units in natural language, with phrases made up of two or more words and sentences created by combining phrases that meet specific grammatical rules. Multiple statements with complicated semantics can include paragraphs, articles, or dialogues. In this study, the concept of NLP is used to treat the protein sequence of a single species as an article, with each sequence treated as a sentence. Because the extended data have already been tokenized, we may directly create word hierarchies (Fig S3a). After determining the hierarchical links, we apply the word embedding model trained with word2vec to generate sequence vectors (Fig S3b).
Contextual feature extraction and classification
The gated recurrent unit (GRU) is a variant of the recurrent neural network (RNN), which overcomes the problem of long-distance dependence and reduces the vanishing gradient phenomenon in the traditional RNN by introducing a gating mechanism. The GRU consists of updating gates and reset gates, which are used to select and control the transmission and updating of the information through the gating mechanism. The specific calculation process is shown in Eqs. 5- 8.
$$\begin{aligned} z_{t} = \sigma \left( W_{z} \cdot x_{t}+U_{z} \cdot h_{t-1}+b_{z}\right) \end{aligned}$$
(5)
$$\begin{aligned} r_{t} = \sigma \left( W_{r} \cdot x_{t}+U_{r} \cdot h_{t-1}+b_{r}\right) \end{aligned}$$
(6)
$$\begin{aligned} \widetilde{h_{t}} = \tanh \left( W_{h} \cdot x_{t}+r_{t} \odot \left( U_{h} \cdot h_{t-1}\right) +b_{h}\right) \end{aligned}$$
(7)
$$\begin{aligned} h_{t} = \left( 1-z_{t}\right) \odot h_{t-1}+z_{t} \odot \widetilde{h_{t}} \end{aligned}$$
(8)
where \(W_z\),\(W_r\),\(W_h\), \(U_z\), \(U_r\) and \(U_h\) represent the weight parameters of the network, and \(b_z\), \(b_r\) and \(b_h\) denote the bias terms. \(z_t\) represents the update gate, and \(r_t\) is the reset gate, which controls the proportion of information passing through. The reset gate \(r_t\) determines the amount of past information used for computing the candidate state; if \(r_t\)=0, it indicates that all previous states are forgotten. The update gate \(z_t\) is used to control the retention of past information and the incorporation of new information. \(\sigma\) represents the sigmoid activation function, \(h_{t-1}\) denotes the hidden state at the previous time step, \(\widetilde{h_{t}}\) represents the candidate hidden state, and \(h_t\) denotes the hidden state at time t.
Since it is difficult to capture complex relationships between genes using GRUs and consider the inconsistent length of species sequence data, this paper adopts two parallel channels of the bidirectional gated recurrent unit [45] (BiGRU) to model features between extracted species genes and address unequal-length sequences, where one of the GRUs is used to model semantics from the beginning of a sentence to the end and the other is used for textual representations from the end of a sentence to the beginning of a sentence; then the hidden state of the two GRUs are connected as a representation of each position. In this way, the output of the current moment is related not only to the previous state but also the future state, thus considering contextual information at the same time. The process of bidirectional computation is shown in Eqs. 9- 11.
$$\begin{aligned} \overrightarrow{h_{t}}=GRU\left( \overrightarrow{h_{t-1}},x_{t}\right) \end{aligned}$$
(9)
$$\begin{aligned} \overleftarrow{h_{t}}=GRU\left( \overleftarrow{h_{t+1}},x_{t}\right) \end{aligned}$$
(10)
$$\begin{aligned} h_{t}=\left[ \overrightarrow{h_{t}}, \overleftarrow{h_{t}}\right] \end{aligned}$$
(11)
where \(\overrightarrow{h_{t}}\) represents the forward hidden state output, \(\overleftarrow{h_{t}}\) represents the backwards hidden state output, and \(x_t\) denotes the semantic feature input (gene features). The Multi-head self-attention [45] (MHSA) is incorporated into the BiGRU model to dynamically assign different weights to various time steps in the input sequence, enabling a more flexible capture of useful information in long sequences. The MHSA consists of multiple SA modules which were connected in series by channels. Each SA contained three learnable weight matrices: the query (\(W^Q\)), key (\(W^K\)), and value (\(W^V\)). We applied the input features X to these weight matrices, resulting in three matrices for Q, K, and V(\(Q=XW^Q\), \(K=XW^K\), and \(V=XW^V\)), the SA operation can be calculated as follow:
$$\begin{aligned} Attention(Q,K,V) = softmax(\frac{Q\cdot {K}^{T}}{\sqrt{d_{k}}})V \end{aligned}$$
(12)
where \(\frac{Q\cdot K^{T}}{\sqrt{d_{k}}}\) represents the attention score; \(d_{k}\) is the size of the dimension of the query and the key, which is used to scale the size of attention. The MHSA contains several parallel self-attention heads, which can be defined as follows:
$$\begin{aligned} head_i = Attention(QW_i^{Q},KW_i^{K},VW_i^{V}) \end{aligned}$$
(13)
$$\begin{aligned} MHSA(Q,K,V) = Concat(head_i,\cdots , head_h)W^{O} \end{aligned}$$
(14)
where, \(head_i\) denotes the ith attention, \(W_i^{Q},W_i^{K},W_i^{V}\), and \(W^O\) are the projection matrices learned by MHSA.
Weight fusion and genetic evaluation
This study considers the impact of random sampling on the model and aims to ensure its stability. On the basis of the random sampling method in data augmentation, we sampled three batches of data as training datasets. Each batch was transformed into feature vectors and then input into the classification model for training (WAGA1, WAGA2 and WAGA3). For the trained model, we performed three additional rounds of sampling on the species to generate weight files, which were individually tested with the model. Under the condition that the three models can be classified correctly, the weight information of each gene is obtained through the attention mechanism and normalised and summed to obtain three weight files (\(GWS_1, GWS_2\) and \(GWS_3\)). By ranking and selecting the top 3% of groups from each weight file, their intersection was identified to form \(GWS_4\) (15), which was designated the high-weight gene.
$$\begin{aligned} GWS_4 = GWS_1\cap GWS_2\cap GWS_3 \end{aligned}$$
(15)
To further investigate the biological processes and molecular functions of the genes associated with \(GWS_4\) in freshwater and saltwater fish, enrichment analysis was conducted using the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) enrichment analysis. These analyses revealed the biological roles of the selected high-weight genes in these two groups. First, zebrafish annotations were generated using the KEGG database and utilized as a background file (Supplementary Data S11) on the GeneCloud platform to annotate high-weight genes. ID conversion (Supplementary File S12) was subsequently carried out using the BioDBnet platform. GO enrichment analysis was subsequently performed using the OmicShare platform.
Quantification and statistical analysis
We tested the models and recorded the Accuracy (Acc), Recall (R), Precision (p), and F1-score (F1) of the three models on protein sequences from the test set and 128 real species. Meanwhile, we plotted the ROC curves as well as the confusion matrices to evaluate the performance of the WAGA model using marine fish as the positive class (label=1). The formulae were calculated as shown in Eqs. (16)- (19).
$$\begin{aligned} Acc = \frac{TP+TN}{TP+TN+FN+FP} \end{aligned}$$
(16)
$$\begin{aligned} P = \frac{TP}{TP+FP} \end{aligned}$$
(17)
$$\begin{aligned} R = \frac{TP}{TP+FN} \end{aligned}$$
(18)
$$\begin{aligned} F1 = \frac{2 \times P \times R}{P+R} \end{aligned}$$
(19)
where TP denotes the number of samples correctly predicted to be in the positive category, TN denotes the number of samples correctly predicted to be in the negative category, FP denotes the number of samples incorrectly predicted to be in the positive category (false positives), and FN denotes the number of samples incorrectly predicted to be in the negative category (missed positives).
Model comparison with deep learning and machine learning
In addition to evaluating the performance of WAGA, we compared it with several widely used baseline models, including traditional machine learning algorithms such as Support Vector Machine (SVM), and Logistic Regression, as well as deep learning-based models like GRU, BiLSTM, and long short-term memory [46] (LSTM). All models were trained on the WAGA1’s BPE-tokenized protein-coding gene sequences, using identical training, validation, and test sets, and consistent hyperparameter settings to ensure a fair comparison. The evaluation metrics used for comparison included accuracy, precision, recall, and F1-score.
Discussion
Reliability of the sampling method
The quality and diversity of gene data determine the performance and reliability of deep learning models. Mao et al. (2023) used a sampling method that divides a species’ sequence data into N equal parts and then randomly samples half of the sequence data from each part, and the sequence data taken out of each part serve as a sampling section [47]. This method is suitable for improving the diversity of samples in a single species dataset, but is not suitable for capturing evolutionary signals among genes of multiple species. To capture the evolutionary signals between genes of multiple species, this method lacks applicability. Therefore, in this study, homologous genome clusters were sampled from multiple species to generate a sufficient number of samples to enhance the learning ability and generalization performance of the model while maintaining the integrity of gene sequence information.
First, the genomic data of different fish were subjected to BPE disambiguation, and the gene information of different fish was represented as disambiguated data with consistent features at this stage; then, random sampling of groups was performed using Fig. 7 in Data augmentation section, and each group sample represented its gene cluster with similar functions in different fish. This sampling method helps the model to identify key genes that are conserved or have Changed significantly during the evolutionary process; finally, the data after BPE partitioning of 128 fish species were randomly sampled according to 14,000 groups per sampling, thus achieving data expansion for each species. By sampling homologous genes, the diversity of genes as well as conserved regions among different species can be captured, enriching the dimension of the training data. Compared with sampling from the division of a single species, this approach allows the model to better understand the macroscopic trends of genetic changes during evolution, and the WAGA model results demonstrate the reliability of the homologous gene cluster-based sampling approach.
Reliability of the key genes obtained by the WAGA
Previous studies have shown that freshwater and marine fishes have evolved significant adaptive differences in response to different aquatic environments, involving mechanisms such as ion transport, osmoregulation, cellular stress response, and immune response. These differences are not only reflected at the level of organ function, but more profoundly at the levels of gene expression regulation and pathway functionality [5, 48, 49]. In addition, Jiang et al. (2019) found that, by comparing the deep-sea fish Pseudoliparis swirei and its shallow-water relative Liparis tanakae, certain subfamilies of OR (olfactory receptor) genes were significantly reduced in the pelagic species. This suggests that environmental pressure may drive the loss of gene function or adaptive modification [50].
Based on the results of the WAGA model, this study identified several key genes closely related to aquatic environmental adaptation in fish. Some of these genes have been confirmed in previous studies, validating the effectiveness and reliability of the proposed method. In terms of osmoregulation and ion balance, the cldn (claudins) gene family [51, 52] is primarily enriched in pathways such as response to stimulus (BP), cell adhesion molecules, and tight junction (KEGG). Tight junctions play a crucial role in regulating water and salt permeability, indicating their core function in maintaining salinity balance. The slc6a gene family (solute carrier 6) [53, 54] is significantly enriched in gamma-aminobutyric acid: sodium symporter activity (MF) and membrane part (CC), participating in neurotransmitter transport coupled with sodium ions, highlighting its key role in regulating intracellular and extracellular ion homeostasis. The anxa11 gene is enriched in calcium ion binding (MF) and cytokinetic process (BP), supporting its important role in calcium ion homeostasis regulation and cellular stress responses [5, 55]. In terms of visual adaptation, significant differences in light intensity, spectral composition, and water transparency between freshwater and seawater have imposed long-term selective pressure on the visual systems of fish [56]. The cry gene family [57] is significantly enriched in structural constituents of eye lens (MF) and visual perception (BP), indicating its important role in maintaining lens structure and visual function, and it may also be involved in regulating the circadian rhythms of fish.
In terms of olfactory function, the diversity of ions and chemical signals in different aquatic environments requires fish to have a refined chemical sensing system. The TAAR gene family (trace amine-associated receptors) [58, 59] is enriched in Neuroactive ligand-receptor interaction (KEGG) and cellular response to stimulus (BP), participating in the recognition of environmental chemical signals; the v2r gene family (vomeronasal type-2 receptors) [60] is enriched in cellular response to stimulus (BP) and signal transducer activity (MF), suggesting its role in mediating perception through signal transduction pathways; the OR gene family (olfactory receptors) [61] is enriched in odorant binding and olfactory receptor activity (MF), demonstrating its central role in odor recognition and olfactory regulation. In terms of nutrient metabolism and dietary adaptation, the chia gene family (chitinases) is enriched in GO terms such as organic substance catabolic process (BP), chitin binding, and chitinase activity (MF), indicating its important role in chitin degradation and digestive functions. Studies have found that fish stomachs contain various chitinases with differences in substrate specificity, suggesting that the chitinase genes have evolved to adapt to the dietary habits and ecological niches of different species [62, 63].
In addition to the known adaptation-related genes, this study also identified a group of potential novel genes that may be closely associated with fish physiological functions and their environmental adaptation mechanisms. These genes are predominantly enriched in key pathways such as immune regulation, muscle structure, signal transduction, and organ development, suggesting they may play important roles in adaptive physiological processes. For example, in terms of immune response and cellular homeostasis regulation, the litaf gene (lipopolysaccharide-induced TNF\(\alpha\) factor) [64], epdl1 (GO: lysosome), and tnfrsf14 (GO: cytokine production involved in immune response) may assist fish in maintaining cellular homeostasis and immune balance by regulating lysosomal function and inflammatory responses when facing varying microbial loads, salinity stress, or oxidative challenges in different aquatic environments. In terms of muscle structure and function, the csrp gene family (csrp1a, csrp2, csrp1b, csrp3) is primarily enriched in GO terms related to sarcomeric structures such as the Z disc, I band, and myofibril, participating in muscle fiber assembly and mechanical signal transduction. Previous studies have shown that csrp3 (MLP), through regulating the activation of tcap and synergizing with ILK, is essential for skeletal muscle mechanical stability [65]. However, csrp remains unclear whether this is related to freshwater or marine environments. Therefore, this gene family may be involved in fish adaptation to different water flow dynamics or mechanical loads. In terms of tissue development and regulation of cellular behavior, the slit gene is enriched in regulation of axonogenesis (BP) and regulates axon guidance of retinal ganglion cells through interaction with ROBO proteins, with slit2 playing a central role [66, 67]; the megf gene family is mainly enriched in cation binding and metal ion binding GO functions. The proteins encoded by these genes contain EGF-like domains that can bind calcium and other metal ions, playing roles in regulating protein stability and transmembrane signal transduction, which may be crucial in sensing and responding to changes in ion concentrations in aquatic environments.
Additionally, genes associated with cardiovascular system development were significantly enriched. The tmem88a gene was enriched in the canonical Wnt signaling pathway involved in heart development (BP), playing a crucial regulatory role in cardiomyocyte differentiation [68]; her4 (Hairy-related 4, ErbB4) was enriched in protein dimerization activity (MF); as a transcriptional regulator, its dimerization function is vital for neural and cardiovascular development. Given the cardiovascular system’s essential role in ion transport, osmotic pressure regulation, and oxygen delivery in aquatic environments, these genes likely underpin the molecular basis for structural and functional adaptations of the circulatory system in fish inhabiting diverse water conditions. Furthermore, the Hamp gene was significantly enriched in pathways related to cellular iron ion homeostasis and overall iron balance, underscoring its role in maintaining iron equilibrium. Previous studies indicate that iron is critical for cell growth and hematopoiesis, regulated by hepcidin (Hamp) [69]. Notably, hepcidin expression is upregulated in response to infections by marine pathogens such as Streptococcus iniae and Aeromonas salmonicida [70]. These results not only deepen our understanding of the molecular mechanisms underlying aquatic adaptation in fish but also provide valuable genetic resources for subsequent functional validation and ecological adaptation studies.
Limitations and future directions
Despite the strong performance of the WAGA model in identifying key genes associated with environmental adaptation in freshwater and marine fish species, certain Limitations remain. Notably, although the dataset included 22 euryhaline species, these were excluded from model training due to the lack of explicit environmental labels. Euryhaline fish, which are capable of tolerating a wide range of salinities, often exhibit distinct physiological and genetic adaptations. As such, the model’s generalizability to ecologically plastic species remains uncertain.
Future work will seek to incorporate euryhaline species into the model training and evaluation pipeline to assess WAGA’s ability to identify adaptive gene expression patterns across salinity gradients. In addition, we plan to explore domain adaptation and multi-domain learning strategies to improve the model’s robustness and generalization across diverse aquatic environments. These advancements will further expand the applicability of WAGA and, when integrated with gene-editing and behavioral ecology approaches, help elucidate the functional roles of key genes in adaptive evolution. Ultimately, this work will provide a scientific foundation for fish conservation and sustainable utilization.
Conclusion
In this study, we designed a gene mining method called WAGA, which combines the molecular mechanisms of fish with their ecological adaptations and models them as a computer classification problem. The WAGA integrated BPE segmentation, word2vec word embedding and deep learning technologies, and applied deep learning to mine protein-coding genes in freshwater fish and Marine fish for the first time, successfully identifying key genes related to sensory function, osmotic regulation and growth and development. This method reveals the environmental adaptation mechanisms of fish and supports targeted breeding to enhance economic, social, and ecological benefits. These results are highly important for the advancement of aquaculture science and the development of aquaculture industry, and provide key clues for an in-depth understanding of the adaptive evolution of fish in response to different environmental pressures.
Data availability
The dataset used in this study consists of protein-coding gene data for Actinopterygii, which was downloaded from the National Center for Biotechnology Information (NCBI) database. Specific information numbers and download details are provided in the supplementary materials 1. All source codes used in this study are publicly available at GitHub: https://github.com/songpingqian/WAGA. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Materials availability
Materials are available upon request to Y.C.([email protected]).
DeLaurier A. Evolution and development of the fish jaw skeleton. WIREs Dev Biol. 2019;8(2):e337. https://doi.org/10.1002/wdev.337.
Nelson JS, Grande TC, Wilson MV. Fishes of the World. Hoboken: Wiley; 2016.
Fricke R, Eschmeyer WN, Van der Laan R, editors. Eschmeyer’s catalog of fishes: genera, species, references. California Academy of Sciences: San Francisco; 2025.
Friedman M, Sallan LC. Five hundred million years of extinction and recovery: a phanerozoic survey of large-scale diversity patterns in fishes. Palaeontology. 2012;55(4):707–42. https://doi.org/10.1111/j.1475-4983.2012.01165.x.
Evans DH, Piermarini PM, Choe KP. The multifunctional fish gill: dominant site of gas exchange, osmoregulation, acid-base regulation, and excretion of nitrogenous waste. Physiol Rev. 2005;85(1):97–177. https://doi.org/10.1152/physrev.00050.2003.
Shu D, Morris S, Han J, et al. Head and backbone of the early Cambrian vertebrate Haikouichthys. Nature. 2003;421:526–9. https://doi.org/10.1038/nature01264.
Shu D, Luo HL, Conway Morris S, et al. Lower Cambrian vertebrates from south China. Nature. 1999;402:42–6. https://doi.org/10.1038/46965.
Pinheiro H, Bernardi G, Simon T, et al. Island biogeography of marine organisms. Nature. 2017;549:82–5. https://doi.org/10.1038/nature23680.
Xu L, Feiner ZS, Frater P, et al. Asymmetric impacts of climate change on thermal habitat suitability for inland lake fishes. Nat Commun. 2024;15:10273. https://doi.org/10.1038/s41467-024-54533-2.
Chen X, Liu S, Ding Q, Teame T, Yang Y, Ran C, et al. Research advances in the structure, function, and regulation of the gill barrier in teleost fish. Water Biology and Security. 2023;2(2):100139. https://doi.org/10.1016/j.watbs.2023.100139.
Lobato FL, Barneche DR, Siqueira AC, Liedke AMR, Lindner A, Pie MR, et al. Diet and Diversification in the Evolution of Coral Reef Fishes. PLoS ONE. 2014;9(7):1–11. https://doi.org/10.1371/journal.pone.0102094.
Helfman GS, Collette BB, Facey DE. The Diversity of Fishes: Biology, Evolution, and Ecology. Hoboken: Wiley; 2009.
Zhou Q, Wang J, Li J, et al. Decoding the fish genome opens a new era in important trait research and molecular breeding in China. Sci China Life Sci. 2024;67:2064–83. https://doi.org/10.1007/s11427-023-2670-5.
Ahmad SF, Jehangir M, Srikulnath K, et al. Fish genomics and its impact on fundamental and applied research of vertebrate biology. Rev Fish Biol Fisheries. 2022;32:357–85. https://doi.org/10.1007/s11160-021-09691-7.
Riley LG, Hirano T, Grau EG. Effects of transfer from seawater to fresh water on the growth hormone/insulin-like growth factor-I axis and prolactin in the Tilapia, Oreochromis mossambicus. Comp Biochem Physiol B: Biochem Mol Biol. 2003;136(4):647–55. https://doi.org/10.1016/S1096-4959(03)00246-X.
Pierce AL, Fox BK, Davis LK, Visitacion N, Kitahashi T, Hirano T, et al. Prolactin receptor, growth hormone receptor, and putative somatolactin receptor in Mozambique tilapia: tissue specific expression and differential regulation by salinity and fasting. Gen Comp Endocrinol. 2007;154(1):31–40. https://doi.org/10.1016/j.ygcen.2007.06.023.
Sakamoto T, Hirano T. Expression of insulin-like growth factor I gene in osmoregulatory organs during seawater adaptation of the salmonid fish: possible mode of osmoregulatory action of growth hormone. Proc Natl Acad Sci. 1993;90(5):1912–6. https://doi.org/10.1073/pnas.90.5.1912.
Tomy S, Chang YM, Chen YH, Cao JC, Wang TP, Chang CF. Salinity effects on the expression of osmoregulatory genes in the euryhaline black porgy Acanthopagrus schlegeli. Gen Comp Endocrinol. 2009;161(1):123–32. https://doi.org/10.1016/j.ygcen.2008.12.003.
Takahashi H, Sakamoto T, Hyodo S, Shepherd BS, Kaneko T, Grau EG. Expression of glucocorticoid receptor in the intestine of a euryhaline teleost, the Mozambique tilapia (Oreochromis mossambicus): effect of seawater exposure and cortisol treatment. Life Sci. 2006;78(20):2329–35. https://doi.org/10.1016/j.lfs.2005.09.050.
Kiilerich P, Tipsmark CK, Borski RJ, Madsen SS. Differential effects of cortisol and 11-deoxycorticosterone on ion transport protein mRNA levels in gills of two euryhaline teleosts, Mozambique tilapia (Oreochromis mossambicus) and striped bass (Morone saxatilis). J Endocrinol. 2011;209(1):115–26. https://doi.org/10.1530/JOE-10-0326.
Deal CK, Volkoff H. The Role of the Thyroid Axis in Fish. Front Endocrinol. 2020;11. https://doi.org/10.3389/fendo.2020.596585.
Manzon LA. The role of prolactin in fish osmoregulation: a review. Gen Comp Endocrinol. 2002;125(2):291–310. https://doi.org/10.1006/gcen.2001.7746.
Yue GH, Ma KY, Xia JH. Status of conventional and molecular breeding of salinity-tolerant tilapia. Rev Aquacult. 2024;16(1):271–86. https://doi.org/10.1111/raq.12838.
Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20(8):389–403. https://doi.org/10.1038/s41576-019-0122-6.
Li H, Chen W, Qi W, et al. Molecular characterization of a novel Spiruromorpha species in wild Chinese pangolin by mitogenome sequence analysis. Parasitol Res. 2024;123:137. https://doi.org/10.1007/s00436-024-08143-y.
Pavithra M, Parvathi RMS. Optimizing minimum spanning tree using stochastic-Variable neighborhood search for efficient clustering of cancer gene data. Concurr Comput Pract Experience. 2023;35(5):e7573. https://doi.org/10.1002/cpe.7573.
Pirmoradi S, Hosseiniyan Khatibi SM, Zununi Vahed S, et al. Unraveling the link between PTBP1 and severe asthma through machine learning and association rule mining method. Sci Rep. 2023;13:15399. https://doi.org/10.1038/s41598-023-42581-5.
Wang X, Li J, Zhou Y, Chen Y, Liu T, Chen S. Tumor origin identification through machine learning and gene expression profiling. J Clin Oncol. 2024;42(16_suppl):e13597. https://doi.org/10.1200/JCO.2024.42.16_suppl.e13597.
Avramouli A, Krokidis MG, Exarchos TP, Vlamos P. Vlamos P, editor. Protein Structure Prediction for Disease-Related Insertions/Deletions in Presenilin 1 Gene. Cham: Springer International Publishing; 2023.
Manouchehri L, Zinati Z, Nazari L. Population-specific gene expression profiles in prostate cancer: insights from weighted gene co-expression network analysis (WGCNA). World J Surg Oncol. 2024. https://doi.org/10.1186/s12957-024-03459-6.
Abbas Q, Wilhelm M, Kuster B, et al. Exploring crop genomes: assembly features, gene prediction accuracy, and implications for proteomics studies. BMC Genomics. 2024;25:619. https://doi.org/10.1186/s12864-024-10521-w.
Hayman GT, Laulederkind SJF, Smith JR, Wang SJ, Petri V, Nigam R, et al. The disease portals, disease–gene annotation and the RGD disease ontology at the Rat Genome Database. Database. 2016;2016:baw034. https://doi.org/10.1093/database/baw034.
Davidson PL, Moczek AP. Genome evolution and divergence in cis-regulatory architecture is associated with condition-responsive development in horned dung beetles. PLoS Genet. 2024;20(3):1–21. https://doi.org/10.1371/journal.pgen.1011165.
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, et al. Deep learning for genomics: A concise overview. 2018. arXiv preprint arXiv:1802.00810. https://doi.org/10.48550/arXiv.1802.00810
Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning. Bioinformatics. 2016;32(12):1832–9. https://doi.org/10.1093/bioinformatics/btw074.
Bryant P. Deep learning for protein complex structure prediction. Curr Opin Struct Biol. 2023;79:102529. https://doi.org/10.1016/j.sbi.2023.102529.
Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE. 2015;10(11):1–15. https://doi.org/10.1371/journal.pone.0141287.
Geng Q, Yang R, Zhang L. A deep learning framework for enhancer prediction using word embedding and sequence generation. Biophys Chem. 2022;286:106822. https://doi.org/10.1016/j.bpc.2022.106822.
Urda D, Montes-Torres J, Moreno F, Franco L, Jerez JM. Deep Learning to Analyze RNA-Seq Gene Expression Data. In: Rojas I, Joya G, Catala A, editors. Advances in Computational Intelligence. Cham: Springer International Publishing; 2017. p. 50–9.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. NIPS’17. Red Hook: Curran Associates Inc.; 2017. pp. 6000–10.
Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinforma. 2016;18(5):851–69. https://doi.org/10.1093/bib/bbw068.
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics. 2022;16:26. https://doi.org/10.1186/s40246-022-00396-x.
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, et al. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024. https://doi.org/10.1016/j.csbj.2024.05.025.
Mikolov T. Efficient estimation of word representations in vector space. 2013;3781. arXiv preprint arXiv:1301.3781.
Zhang T, Jia J, Chen C, Zhang Y, Yu B. BiGRUD-SA: protein S-sulfenylation sites prediction based on BiGRU and self-attention. Comput Biol Med. 2023;163:107145. https://doi.org/10.1016/j.compbiomed.2023.107145.
Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9(8):1735–80.
Mao J, Cao Y, Zhang Y, et al. A novel method for identifying key genes in macroevolution based on deep learning with attention mechanism. Sci Rep. 2023;13:19727. https://doi.org/10.1038/s41598-023-47113-9.
Hwang PP, Lee TH. New insights into fish ion regulation and mitochondrion-rich cells. Comp Biochem Physiol A Mol Integr Physiol. 2007;148(3):479–97. https://doi.org/10.1016/j.cbpa.2007.06.416.
Kültz D, Podrabsky JE, Stillman JH, Tomanek L. Physiological mechanisms used by fish to cope with salinity stress. J Exp Biol. 2015;218(12):1907–14.
Jiang H, Du K, Gan X, Yang L, He S. Massive loss of olfactory receptors but not trace amine-associated receptors in the world’s deepest-living fish (Pseudoliparis swirei). Genes. 2019. https://doi.org/10.3390/genes10110910.
Marshall WS, Breves JP, Doohan EM, Tipsmark CK, Kelly SP, Robertson GN, et al. claudin-10 isoform expression and cation selectivity change with salinity in salt-secreting epithelia of Fundulus heteroclitus. J Exp Biol. 2018;222(1):jeb168906. https://doi.org/10.1242/jeb.168906.
Kolosov D, Bui P, Wilkie MP, Kelly SP. Claudins of sea lamprey (Petromyzon marinus) – organ-specific expression and transcriptional responses to water of varying ion content. J Fish Biol. 2020;96(3):768–81. https://doi.org/10.1111/jfb.14274.
Jomura R, Akanuma S-I, Tachikawa M, Hosoya K-I. SLC6A and SLC16A family of transporters: contribution to transport of creatine and creatine precursors in creatine biosynthesis and distribution. Biochimica et Biophysica Acta (BBA). 2022;1864(3):183840. https://doi.org/10.1016/j.bbamem.2021.183840.
Pramod AB, Foster J, Carvelli L, Henry LK. Slc6 transporters: structure, function, regulation, disease association and therapeutics. Mol Aspects Med. 2013;34(2):197–219. https://doi.org/10.1016/j.mam.2012.07.002.
Hubbard PC, Ingleton PM, Bendell LA, Barata EN, Canário AVM. Olfactory sensitivity to changes in environmental [Ca2+] in the freshwater teleost Carassius auratus: an olfactory role for the Ca2+-sensing receptor? J Exp Biol. 2002;205(18):2755–64. https://doi.org/10.1242/jeb.205.18.2755.
Eilertsen M, Davies WIL, Patel D, Barnes JE, Karlsen R, Mountford JK, et al. An evodevo study of salmonid visual opsin dynamics and photopigment spectral sensitivity. Front Neuroanat. 2022. https://doi.org/10.3389/fnana.2022.945344.
Mei Q, Sadovy Y, Dvornyk V. Molecular evolution of cryptochromes in fishes. Gene. 2015;574(1):112–20. https://doi.org/10.1016/j.gene.2015.07.086.
Hussain A, Saraiva LR, Korsching SI. Positive Darwinian selection and the birth of an olfactory receptor clade in teleosts. Proc Natl Acad Sci. 2009;106(11):4313–8. https://doi.org/10.1073/pnas.0803229106.
Policarpo M, Bemis KE, Laurenti P, et al. Coevolution of the olfactory organ and its receptor repertoire in ray-finned fishes. BMC Biol. 2022;20:195. https://doi.org/10.1186/s12915-022-01397-x.
Johnstone KA, Lubieniecki KP, Koop BF, Davidson WS. Expression of olfactory receptors in different life stages and life histories of wild Atlantic salmon (Salmo salar). Mol Ecol. 2011;20(19):4059–69. https://doi.org/10.1111/j.1365-294X.2011.05251.x.
Hu J, Wang Y, Le Q, Yu N, Cao X, Kuang S, et al. Transcriptome sequencing of olfactory-related genes in olfactory transduction of large yellow croaker (Larimichthys crocea) in response to bile salts. PeerJ. 2019;7:e6627. https://doi.org/10.7717/peerj.6627.
Ikeda M, Kakizaki H, Matsumiya M. Biochemistry of fish stomach chitinase. Int J Biol Macromol. 2017;104:1672–81. https://doi.org/10.1016/j.ijbiomac.2017.03.118.
Holen MM, Vaaje-Kolstad G, Kent MP, Sandve SR. Gene family expansion and functional diversification of chitinase and chitin synthase genes in Atlantic salmon (Salmo salar). G3: Genes, Genomes, Genetics. 2023;13(6):jkad069. https://doi.org/10.1093/g3journal/jkad069.
Stefani C, Bruchez AM, Rosasco MG, Yoshida AE, Fasano KJ, Levan PF, et al. LITAF protects against pore-forming protein-induced cell death by promoting membrane repair. Sci Immunol. 2024;9(91):eabq6541. https://doi.org/10.1016/j.mam.2012.07.002.
Chang Y, Geng F, Hu Y, Ding Y, Zhang R. Zebrafish cysteine and glycine-rich protein 3 is essential for mechanical stability in skeletal muscles. Biochem Biophys Res Commun. 2019;511(3):604–11. https://doi.org/10.1016/j.bbrc.2019.02.115.
Davison C, Zolessi FR. Slit2 is necessary for optic axon organization in the zebrafish ventral midline. Cells Dev. 2021;166:203677. https://doi.org/10.1016/j.cdev.2021.203677.
Dickinson RE, Duncan WC. The SLIT–ROBO pathway: a regulator of cell function with implications for the reproductive system. Reproduction. 2010;139(4):697–704. https://doi.org/10.1530/REP-10-0017.
Palpant NJ, Pabon L, Rabinowitz JS, Hadland BK, Stoick-Cooper CL, Paige SL, et al. Transmembrane protein 88: a Wnt regulatory protein that specifies cardiomyocyte development. Development. 2013;140(18):3799–808. https://doi.org/10.1242/dev.094789.
Yang W, Peng M, Wang Y, Zhang X, Li W, Zhai X, et al. Deletion of hepcidin disrupts iron homeostasis and hematopoiesis in zebrafish embryogenesis. Development. 2025;152(7):dev204307. https://doi.org/10.1242/dev.204307.
Chen SL, Li W, Meng L, Sha ZX, Wang ZJ, Ren GC. Molecular cloning and expression analysis of a hepcidin antimicrobial peptide gene from turbot (Scophthalmus maximus). Fish Shellfish Immunol. 2007;22(3):172–81. https://doi.org/10.1016/j.fsi.2006.04.004.
© 2025. This work is licensed under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.