Definer: A computational method for accurate identification of RNA pseudouridine sites based on deep learning

Abstract

Pseudouridine is an important modification site, which is widely present in a variety of non-coding RNAs and is involved in a variety of important biological processes. Studies have shown that pseudouridine is important in many biological functions such as gene expression, RNA structural stability, and various diseases. Therefore, accurate identification of pseudouridine sites can effectively explain the functional mechanism of this modification site. Due to the rapid increase of genomics data, traditional biological experimental methods to identify RNA modification sites can no longer meet the practical needs, and it is necessary to accurately identify pseudouridine sites from high-throughput RNA sequence data by computational methods. In this study, we propose a deep learning-based computational method, Definer, to accurately identify RNA pseudouridine loci in three species, Homo sapiens, Saccharomyces cerevisiae and Mus musculus. The method incorporates two sequence coding schemes, including NCP and One-hot, and then feeds the extracted RNA sequence features into a deep learning model constructed from CNN, GRU and Attention. The benchmark dataset contains data from three species, H. sapiens, S. cerevisiae and M. musculus, and the results using 10-fold cross-validation show that Definer significantly outperforms other existing methods. Meanwhile, the data sets of two species, H. sapiens and S. cerevisiae, were tested independently to further demonstrate the predictive ability of the model. In summary, our method, Definer, can accurately identify pseudouridine modification sites in RNA.

Full text

Translate

Turn on search term navigation

1. Introduction

RNA modification is an important component of gene regulation and is involved in various biological processes [1,2]. To date, over 150 types of RNA modifications have been discovered in the field of biology [3,4]. Among them, pseudouridine (Ψ) modification is the earliest and most abundant RNA modification found in various types of RNA, including mRNA, tRNAs, and snRNA, etc [5,6]. The most common processes of RNA modification are pseudouridylation and methylation [7]. Studies have shown that pseudouridine can change the secondary and tertiary structure of RNA, affect the speed of gene expression, and is closely related to various diseases, such as Parkinson’s disease, congenital keratinization disorder, and myelodysplastic syndrome keratosis, which are associated with pseudouridine modification mutations [8,9]. Therefore, studying pseudouridine modification sites is of great significance for both biology and medicine [10,11]. With the advent of the post-genomic era, the amount of genomic data has increased rapidly, and traditional biological experiments are no longer able to meet the actual research needs [12]. Therefore, it is necessary to develop more convenient computational models to extract information on pseudouridine sites and accurately identify them [13,14].

Many computational methods based on machine learning and deep learning have been developed for predicting pseudouridine sites in three species, including H. sapiens, S. cerevisiae, and M. musculus. Li et al. [15] constructed the first prediction model for pseudouridine sites, PPUS, based on the SVM algorithm for predicting Homo sapiens and Saccharomyces cerevisiae through a web server. Chen et al. [16] constructed datasets for H. sapiens, S. cerevisiae, and Mus musculus and combined the PseDNC encoding method with SVM to build the iRNA-PseU prediction model. The constructed dataset was further used in subsequent research. He et al. [17] proposed PseUI, which combines five encoding methods with SVM algorithm and further improves the accuracy of pseudouridine site recognition by applying sequence forward feature selection. Subsequently, Tahir et al. [18] used the One-hot encoding method and built a two-layer convolutional neural network to develop iPseU-CNN. Liu et al. [19] developed XG-PseU, a prediction method based on extreme Gradient Boosting (XGBoost). Bi et al. [20] discovered the ensemble learning algorithm, which integrates five different machine learning classifiers to build the EnsemPseU predictor for predicting pseudouridine sites. Lv et al. [21] used the random forest algorithm combined with the light gradient boosting machine algorithm and the incremental feature selection strategy to build a new predictor, RF-PseU, which improved prediction performance. Mu et al. [22] constructed MU-PseUDeep by combining the original sequence and secondary structure with a convolutional neural network, further improving the performance of predicting pseudouridine sites. Song et al. [23,24] constructed PIANO and PSI-MOUSE predictors based on genomic and sequence features for predicting pseudouridine sites. This is the first time that genomic-derived features have been introduced and achieved good performance in predicting pseudouridine sites. Li et al. [25] developed the porpoise predictor based on the stacked ensemble learning method and used four feature selection methods. Table 1 summarizes the existing pseudouridine site predictors, including benchmark datasets, feature extraction, classifiers, performance evaluation, and network servers, and most of these computational methods predict the three species, H. sapiens, S. cerevisiae, and M. musculus, with only PPUS predicting H. sapiens and S. cerevisiae.

[Figure omitted. See PDF.]

Although previous studies have made significant contributions and provided a foundation for subsequent research, there is still considerable room for improvement in predicting RNA sequence performance based on existing methods [26,27]. Developing better prediction methods will enable a comprehensive understanding of the relationship between RNA sequences and life activities. Existing prediction tools mostly rely on a single feature extraction algorithm and traditional machine learning algorithms [28,29]. Due to the extremely complex sequence features exhibited by biological sequences, traditional machine learning methods cannot achieve better prediction performance. Deep learning algorithms have strong learning and generalization abilities, and possess good modeling capabilities [30]. Therefore, it is necessary to improve prediction performance by increasing sequence features and developing more suitable classification algorithms [31].

Based on the above issues, this paper proposes a deep learning-based computational method, Definer, to identify pseudouridine (Ψ) sites in three species, including H. sapiens, S. cerevisiae, and M. musculus. Firstly, we combined the One-hot and NCP feature encoding schemes to extract RNA sequence information. Secondly, we constructed Ψ site prediction models based on three deep learning models: convolutional neural network (CNN), gated recurrent unit (GRU), and attention mechanism. Finally, ten-fold cross-validation and independent testing showed that, compared with state-of-the-art methods, Definer significantly improved the prediction performance on Ψ site identification in all three species.

2. Materials and methods

2.1. Overall framework

The experimental design process and performance evaluation of this study are shown in Fig 1, which includes five main steps: data collection, feature extraction, model construction, performance evaluation, and visualization software development [32]. Firstly, benchmark datasets and independent test sets for three species, H. sapiens, S. cerevisiae, and M. musculus, were collected from relevant literature and public databases [24]. Two feature extraction methods were then employed to extract sequence information from the datasets. Subsequently, a deep learning-based predictor, Definer, was constructed, which achieved good performance on all three species. Furthermore, we evaluated and compared our Definer with several existing methods, and found that its prediction performance was significantly improved. Finally, we developed and made publicly available a software for users to utilize online.

[Figure omitted. See PDF.]

2.2. Benchmark data sets

In order to facilitate comparison with existing methods, we used the dataset constructed by Chen et al. (11), which is commonly used in most of the existing prediction methods, such as iRNA-PseU [15], PseUI [16], iPseU-CNN [17], XG-PseU [18], and Porpoise [25]. The datasets for the three species were obtained from the RMBase database [33], including three benchmark datasets H. sapiens (H_990), S. cerevisiae (S_628), and M. musculus (M_994), which were used for model training, and two independent test sets, which only included H. sapiens (H_200) and S. cerevisiae (S_200) species. The details of the datasets are shown in Table 2.

[Figure omitted. See PDF.]

2.3. Feature extraction

Feature extraction is an important step in building a prediction model, which aims to encode RNA sequence fragments containing only four nucleotides, adenine (A), cytosine (C), guanine (G), and uracil (U), into digitized feature vectors [34]. The way of extracting input data has a great impact on the model. Only by choosing a suitable feature extraction method according to specific conditions can better training results be achieved. Efficient feature extraction methods can effectively extract more representative feature vectors and provide strong support for subsequent model construction [35]. In this study, we used two feature extraction methods, including One-hot encoding and nucleotide chemical properties (NCP). Brief introductions of these two feature extraction methods are presented below.

2.3.1. One-hot encoding.

One-hot encoding is a binary encoding method and one of the basic feature representation methods for RNA sequences. The basic idea of one-hot encoding is to convert each base in the sequence into a four-dimensional binary vector, where only one dimension is 1 and the others are 0. The four nucleotides A, U, C, and G will be respectively converted into vectors (1,0,0,0), (0,1,0,0), (0,0,1,0), and (0,0,0,1) [36].

2.3.2. NCP.

The RNA sequence is composed of four nucleotides: adenine (A), cytosine (C), guanine (G), and uracil (U). These nucleotides have different structures and chemical properties. Nucleotide Chemical Property (NCP) encodes RNA sequences by three different chemical properties, including cyclic structure, hydrogen bonding, and chemical functionality [37]. Regarding cyclic structure, A and G are purines with two rings, while C and U are pyrimidines with one ring. Concerning hydrogen bonding, A and U form two hydrogen bonds during hybridization, while G and C can form three hydrogen bonds [38]. Regarding chemical functionality, A and C contain an amino group, while G and U contain a ketone base. Based on the three different chemical structural properties, nucleotides in RNA sequences can be represented by a vector , where X represents cyclic structure, Y represents hydrogen bonding, and Z represents chemical functionality. The feature representation method of NCP is shown in equation (1).

(1)

2.4. Deep learning model framework

This study aims to construct a prediction model for pseudouridine sites in RNA based on three classical deep learning models: convolutional neural network (CNN), gated recurrent unit (GRU), and attention. Firstly, the input data is processed through the first convolution layer of the CNN, which performs cross-correlation operations on the matrix of each channel from left to right and top to bottom using convolution kernels. Then, the obtained data is regularized to prevent overfitting, as shown in equations (2) and (3).

(2)(3)

Next, the data and parameters are compressed through the pooling layer of CNN, and the compressed data is fed into the second layer of CNN. At the same time, the feature tensor is fused with GRU and the Relu activation function is used to accelerate the convergence of the model, as shown in equation (4). The update gate is used to filter information, and controls the retention level of new and old information input at each time step. The reset gate is used to filter information, and controls the retention level of input information at each position at time t-1 [39].

(4)

Then, following the same method of combining the second layer convolution of CNN with GRU, the feature tensors of the third layer convolution of CNN and GRU are fused. Finally, to focus on important information and fully absorb it, an Attention mechanism is added to the model.

2.5. Performance evaluations

Model evaluation is an important step to verify the feasibility of a model, and there are three commonly used methods: K-fold cross-validation, independent testing, and overlapping checking. In order to facilitate comparison with existing methods and better demonstrate the effectiveness of the proposed method, we choose to use the first two methods for evaluation, respectively based on the training dataset and the testing dataset using 10-fold cross-validation and independent tests. For a classification task, accuracy (ACC) is the most basic evaluation index, which represents the percentage of correct classification. However, the basic evaluation index often cannot reflect the model’s performance well, which may lead to poor judgments. This study uses four evaluation indicators, including specificity (Sp), sensitivity (Sn), accuracy (ACC), and Matthew’s correlation coefficient (MCC) to evaluate the predictive model [40,41]. The calculation formulas for the four evaluation indicators are shown below.

(5)(6)(7)(8)

In this paragraph, TP, FP, TN, and FN respectively represent the number of true positives, false positives, true negatives, and false negatives.

3. Results and discussion

3.1. Distribution of nucleotide positions at pseudouridine sites

To analyze the characteristics of pseudouridine sites in RNA sequences, we used Two Sample Logo [42] to calculate the importance of nucleotides at each position. Two Sample Logo is a tool for computing differences between nucleotide samples and the significance of nucleotides at each position in a sequence. The nucleotide distributions of pseudouridine sites in the H. sapiens, S. cerevisiae, and M. musculus species are shown in Fig 2a, 2b, and 2c, respectively. The size of each letter represents the frequency of the corresponding base at that position, with larger letters indicating higher frequencies. At each position, the letters are arranged in order of dominance from top to bottom, with the most dominant base at the top. From the figures, it can be seen that in H. sapiens, uridine (U) content is highest near the central position 11, while cytidine (C) is mainly distributed at downstream positions 17 and 20. In S. cerevisiae, guanine (G) is mostly distributed in the upper-middle region, while uridine (U) is distributed at the central positions 14, 15, and 16. Adenine (A) is mainly distributed in the upper-middle region, with three consecutive A bases at positions 13, 14, and 15. In M. musculus, uridine (U) is distributed in the upper-middle region, with U bases at positions 9, 11, 12, and 13. These results indicate that there are different nucleotide distribution patterns between pseudouridine and non-pseudouridine sites in the H. sapiens, S. cerevisiae, and M. musculus species, and therefore, it is necessary to establish a universal prediction model across different species.

[Figure omitted. See PDF.]

3.2. Performance comparison analysis of different feature extraction methods

An efficient feature extraction method can effectively extract more representative feature vectors and provide strong support for subsequent model construction. This section compared One-hot, NCP, and their fusion, respectively, by placing these three feature extraction methods into the predictor for comparison. The comparative results for the three species H. sapiens, S. cerevisiae, and M. musculus are shown in Tables 3–5, respectively. Please note that there are no grammar errors in the original text.

[Figure omitted. See PDF.]

From the table, it becomes apparent that the combination of both encoding methods yields superior results across all three datasets in comparison to the utilization of a single feature extraction approach. In the context of the H_990 dataset, the amalgamation of One-hot and NCP surpasses both individual feature extraction methods with respect to all four evaluation metrics. Specifically, it attains an accuracy rate of 82.95%, which represents a notable enhancement of 1.93% over the accuracy achieved by the One-hot method alone. While it is true that the single feature extraction method Sn exhibits a relatively better performance than the fusion on the S_628 and M_944 datasets in terms of a particular aspect, it is important to note that the fusion method demonstrates a significantly more favorable performance in the other three evaluation metrics. The fused ACC values on the S_628 and M_944 datasets are 86.01% and 87.15%, respectively. Based on these comprehensive observations and analyses, we have opted to employ the fusion of One-hot and NCP for the extraction of sequence feature information pertaining to pseudouridine sites, as it offers a more comprehensive and effective means of capturing the essential characteristics and patterns within the data.

3.3. Performance comparison analysis of different models

The classifier is an important component of the experiment and is closely related to the final experimental results. Building a suitable predictive model can greatly improve experimental performance. Deep learning is a machine learning algorithm with feature learning ability. It can extract and learn low-level data features to obtain more abstract high-level features. In recent years, the genomics databases have grown rapidly. Only by using classifiers with stronger learning ability can we better learn and mine effective information in huge databases [43]. In this section, a predictor called Definer was constructed based on commonly used deep learning algorithms. We compared it with several traditional machine learning algorithms and commonly used deep learning algorithms, including SVM, RF, LightGBM, and CNN, based on ten-fold cross-validation on three benchmark datasets. The comparison results of the three benchmark datasets are shown in Fig 3. From the figure, it can be seen that the evaluation indicators of deep learning algorithms are much higher than those of traditional machine learning algorithms. We built a new predictor Definer based on CNN, which integrates two GRUs and introduces Attention. From the prediction results, our model is superior to other classifiers in all four evaluation indicators.

[Figure omitted. See PDF.]

3.4. Comparative analysis on independent datasets

To further substantiate the proficiency of Definer in predicting pseudouridine sites, this section undertakes a comprehensive verification and evaluation process using two independent test sets, which encompass two distinct species, namely Homo sapiens and Saccharomyces cerevisiae. Concurrently, an in-depth comparison of the predictive performance of our proposed method with that of several pre-existing methods was carried out on these two independent test sets, and the corresponding results are meticulously presented in Table 6. As is conspicuously demonstrated in the table, our predictor exhibits a highly significant and preponderant performance, outstripping other methods across all four evaluation metrics on the two independent test datasets. Particularly in the case of Homo sapiens, the achieved Accuracy (ACC) reaches an impressive 83.50%, which represents a remarkable increment of 6% over the hitherto best-performing existing method, Porpoise. This outstanding performance not only attests to the enhanced predictive power of Definer but also underlines its potential to make a substantial contribution in the realm of pseudouridine site prediction and related bioinformatics research.

[Figure omitted. See PDF.]

3.5. Performance comparison with state-of-the-art methods

In this section, a comprehensive comparison was conducted between Definer and a series of state-of-the-art methods across three benchmark datasets, namely XG-PseU [19], iPseU-CNN [18], PseUI [17], iRNA-PseU [16], EnsemPseU [20], RF-PseU [21], and Porpoise [25]. The detailed comparison results are presented in Table 7. Upon performing ten-fold cross-validation on the identical two independent test datasets, it becomes evident that Definer exhibits remarkable superiority. In the case of two species, S. cerevisiae and M. musculus, Definer surpasses the other seven prediction methods with respect to all four evaluation metrics. Even in H. sapiens, although the Sn of the porpoise tool is marginally higher than that of our predictor, it is crucial to note that our predictor demonstrates a significant edge and is far more excellent in the remaining three evaluation metrics. This clearly indicates that Definer not only attains a higher level of accuracy but also showcases enhanced stability when compared to other existing methods. By integrating the comparison outcomes from the independent test set in the preceding section, it can be firmly and conclusively drawn that Definer is capable of precisely and accurately predicting pseudouridine sites within the three species, namely H. sapiens, S. cerevisiae, and M. musculus, thereby establishing its efficacy and reliability in the field of pseudouridine site prediction.

[Figure omitted. See PDF.]

3.6. Software engineering

In the field of bioinformatics, there are numerous commonly used analysis methods and online tools. For example, sequence alignment tools like GGMSA [44] are widely utilized. GGMSA allows for the comparison of nucleotide or amino acid sequences, enabling the identification of homologous sequences and providing insights into evolutionary relationships and functional similarities. It uses efficient algorithms to search large sequence databases rapidly, which is a remarkable technical achievement.

Another important tool is gene expression analysis software such as DESeq2 [45]. It can analyze differential gene expression between different samples or conditions. By applying statistical models and normalization techniques, it helps to identify genes that are significantly upregulated or downregulated, which is crucial for understanding biological processes and disease mechanisms.

Recent advancements in computational methods, such as the MRNDR [46], further enhance the ability to analyze complex biological data and uncover potential drug repurposing opportunities through sophisticated attention mechanisms and deep learning architectures.

However, when it comes to the prediction of potential pseudo-uridine sites from RNA sequences, existing tools have certain limitations. While they may focus on general sequence analysis or other types of RNA modifications, they do not specifically target pseudo-uridine sites with high accuracy and user-friendly visualization.

Software engineering and web servers have become essential in the Internet age. To address the need for identifying potential pseudo-uridine sites, we have developed a software visualization based on our model Definer. It is developed using the Python Tkinter framework. The main interface of this software, as shown in Fig 4, offers a unique solution. It provides users with an intuitive and convenient way to input RNA sequences and obtain predictions of potential pseudo-uridine sites. The model Definer underlying the software has been carefully designed and trained to improve the accuracy of pseudo-uridine site prediction, filling a gap in the existing bioinformatics tool landscape and offering a novel approach to this specific aspect of RNA sequence analysis.

[Figure omitted. See PDF.]

On the main interface, users can enter or copy the RNA sequence they want to query into the input box in the lower right corner, or upload a txt file. Clicking the “example” button allows users to view the required format for the txt file. After a file is successfully uploaded, the data content will be displayed in the lower right box. The software provides three models for users to choose from, named after the corresponding species: H. sapiens, S. cerevisiae, and M. musculus. When users click the “Confire” button above the text box, the model will analyze and calculate the sequence, and return the sequence’s name, length, and whether it contains a site to the user. After the prediction is complete, users can click the “download” button to export the prediction result file to a specified path. The prediction result interface is shown in Fig 5. Please note that there should be no grammar errors.

[Figure omitted. See PDF.]

4. Discussion

The accurate identification of pseudouridine sites is of great significance as it is involved in numerous crucial biological processes. In this study, we developed a novel computational method, Definer, to address the challenge of identifying RNA pseudouridine loci in H. sapiens, S. cerevisiae, and M. musculus.

With the explosive growth of genomics data, the limitations of traditional experimental methods for identifying RNA modification sites have become increasingly prominent. Computational methods have emerged as a powerful alternative. Our proposed Definer method combines two sequence coding schemes, NCP and One-hot, which allows for a more comprehensive representation of RNA sequence features. By feeding these features into a deep learning model composed of CNN, GRU, and Attention, we were able to capture both local and global sequence information, as well as the importance of different regions within the sequence.

The benchmark dataset, which includes data from three species, provided a solid foundation for evaluating the performance of Definer. The results of 10-fold cross-validation clearly demonstrated that Definer outperforms other existing methods. This superiority can be attributed to the effective combination of the sequence coding schemes and the powerful deep learning architecture. The independent testing of the data sets of H. sapiens and S. cerevisiae further validated the robustness and predictive ability of the model.

However, it is important to note that there are still some limitations and areas for improvement. For example, although Definer has shown good performance, the complexity of biological systems means that there may be other factors that affect pseudouridine site identification that have not been fully considered. Future studies could explore incorporating additional types of data, such as structural information or epigenetic marks, to further enhance the accuracy of the model.

In addition, as new experimental techniques for detecting pseudouridine sites are developed, the benchmark datasets may need to be updated and refined to ensure the continued relevance and effectiveness of computational methods like Definer. Overall, our study represents an important step forward in the accurate identification of RNA pseudouridine sites, and we believe that Definer has the potential to be a valuable tool in further understanding the functional mechanisms of this important modification site and its implications in various biological processes and diseases.

5. Conclusion

In this research, we have tackled the crucial task of identifying RNA pseudouridine sites. Pseudouridine, being widely distributed in non-coding RNAs and implicated in essential biological functions and diseases, demands accurate identification to decipher its functional mechanisms. The exponential growth of genomics data has necessitated the development of computational approaches, as traditional experimental methods have become insufficient.

Our proposed method, Definer, offers a promising solution. By integrating two sequence coding strategies, NCP and One-hot, it is capable of extracting comprehensive RNA sequence features. These features are then processed by a deep learning model composed of CNN, GRU, and Attention, which effectively capture the complex patterns and relationships within the RNA sequences.

The evaluation using a benchmark dataset covering three species, H. sapiens, S. cerevisiae, and M. musculus, and the application of 10-fold cross-validation have provided robust evidence that Definer outperforms existing methods. The independent testing of the datasets from H. sapiens and S. cerevisiae further bolsters the confidence in the model’s predictive capabilities.

Overall, Definer represents a significant advancement in the field of RNA pseudouridine site identification. It has the potential to enhance our understanding of the role of pseudouridine in gene expression, RNA structural stability, and disease mechanisms. Future research could focus on further optimizing the method, exploring its application in other species or RNA types, and investigating potential synergies with other omics data to provide a more holistic view of the complex regulatory networks involving pseudouridine. With continued development and refinement, Definer could become an invaluable tool in both basic biological research and clinical applications related to RNA modifications.

References

1. 1. Jack K, Bellodi C, Landry DM, Niederer RO, Meskauskas A, Musalgaonkar S, et al. rRNA pseudouridylation defects affect ribosomal ligand binding and translational fidelity from yeast to human cells. Mol Cell. 2011;44(4):660–6. pmid:22099312

* View Article

* PubMed/NCBI

* Google Scholar

2. 2. Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature. 2014;515(7525):143–6. pmid:25192136

* View Article

* PubMed/NCBI

* Google Scholar

3. 3. Pelchat M, Perreault J-P. Binding site of Escherichia coli RNA polymerase to an RNA promoter. Biochem Biophys Res Commun. 2004;319(2):636–42. pmid:15178453

* View Article

* PubMed/NCBI

* Google Scholar

4. 4. Maroney PA, Romfo CM, Nilsen TW. Nuclease protection of RNAs containing site-specific labels: a rapid method for mapping RNA-protein interactions. RNA. 2000;6(12):1905–9. pmid:11142388

* View Article

* PubMed/NCBI

* Google Scholar

5. 5. Basak A, Query CC. A pseudouridine residue in the spliceosome core is part of the filamentous growth program in yeast. Cell Rep. 2014;8(4):966–73. pmid:25127136

* View Article

* PubMed/NCBI

* Google Scholar

6. 6. Mei Y-P, Liao J-P, Shen J, Yu L, Liu B-L, Liu L, et al. Small nucleolar RNA 42 acts as an oncogene in lung tumorigenesis. Oncogene. 2012;31(22):2794–804. pmid:21986946

* View Article

* PubMed/NCBI

* Google Scholar

7. 7. Cohn WE. Pseudouridine, a carbon-carbon linked ribonucleoside in ribonucleic acids: isolation, structure, and chemical characteristics. J Biol Chem. 1960;235(5):1488–98.

* View Article

* Google Scholar

8. 8. Li X, Ma S, Yi C. Pseudouridine: the fifth RNA nucleotide with renewed interests. Curr Opin Chem Biol. 2016;33:108–16. pmid:27348156

* View Article

* PubMed/NCBI

* Google Scholar

9. 9. Chan CM, Huang RH. Enzymatic characterization and mutational studies of TruD--the fifth family of pseudouridine synthases. Arch Biochem Biophys. 2009;489(1–2):15–9. pmid:19664587

* View Article

* PubMed/NCBI

* Google Scholar

10. 10. Karijolich J, Yu Y-T. Converting nonsense codons into sense codons by targeted pseudouridylation. Nature. 2011;474(7351):395–8. pmid:21677757

* View Article

* PubMed/NCBI

* Google Scholar

11. 11. Zhang Y, Hamada M. DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning. BMC Bioinform. 2018;19(Suppl 19):524. pmid:30598068

* View Article

* PubMed/NCBI

* Google Scholar

12. 12. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2. pmid:20110278

* View Article

* PubMed/NCBI

* Google Scholar

13. 13. Swami A, Jain R. Scikit-learn: machine learning in Python. J Machine Learn Res. 2013;12(10):2825–30.

* View Article

* Google Scholar

14. 14. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. pmid:16731699

* View Article

* PubMed/NCBI

* Google Scholar

15. 15. Li Y-H, Zhang G, Cui Q. PPUS: a web server to predict PUS-specific pseudouridine sites. Bioinformatics. 2015;31(20):3362–4. pmid:26076723

* View Article

* PubMed/NCBI

* Google Scholar

16. 16. Chen W, Tang H, Ye J, Lin H, Chou K-C. iRNA-PseU: identifying RNA pseudouridine sites. Mol Ther Nucleic Acids. 2016;5(7):e332. pmid:28427142

* View Article

* PubMed/NCBI

* Google Scholar

17. 17. He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y. PseUI: pseudouridine sites identification based on RNA sequence information. BMC Bioinform. 2018;19(1):306. pmid:30157750

* View Article

* PubMed/NCBI

* Google Scholar

18. 18. Tahir M, Tayara H, Chong KT. iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Mol Ther Nucleic Acids. 2019;16:463–70. pmid:31048185

* View Article

* PubMed/NCBI

* Google Scholar

19. 19. Liu K, Chen W, Lin H. XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites. Mol Genet Genomics. 2020;295(1):13–21. pmid:31392406

* View Article

* PubMed/NCBI

* Google Scholar

20. 20. Bi Y, Jin D, Jia C. EnsemPseU: identifying pseudouridine sites with an ensemble approach. IEEE Access. 2020;8:79376–82.

* View Article

* Google Scholar

21. 21. Lv Z, Zhang J, Ding H, Zou Q. RF-PseU: a random forest predictor for RNA pseudouridine sites. Front Bioeng Biotechnol. 2020;8:134. pmid:32175316

* View Article

* PubMed/NCBI

* Google Scholar

22. 22. Khan SM, He F, Wang D, Chen Y, Xu D. MU-PseUDeep: a deep learning method for prediction of pseudouridine sites. Comput Struct Biotechnol J. 2020;18:1877–83. pmid:32774783

* View Article

* PubMed/NCBI

* Google Scholar

23. 23. Song B, Tang Y, Wei Z, Liu G, Su J, Meng J, et al. PIANO: a web server for pseudouridine-site (Ψ) identification and functional annotation. Front Genet. 2020;11:88. pmid:32226440

* View Article

* PubMed/NCBI

* Google Scholar

24. 24. Song B, Chen K, Tang Y, Ma J, Meng J, Wei Z. PSI-MOUSE: predicting mouse pseudouridine sites from sequence and genome-derived features. Evol Bioinform Online. 2020;16:1176934320925752. pmid:32565674

* View Article

* PubMed/NCBI

* Google Scholar

25. 25. Li F, Guo X, Jin P, Chen J, Xiang D, Song J, et al. Porpoise: a new approach for accurate prediction of RNA pseudouridine sites. Brief Bioinform. 2021;22(6):bbab245. pmid:34226915

* View Article

* PubMed/NCBI

* Google Scholar

26. 26. Niu M, Zhang J, Li Y, Wang C, Liu Z, Ding H, et al. CirRNAPL: a web server for the identification of circRNA based on extreme learning machine. Comput Struct Biotechnol J. 2020;18:834–42. pmid:32308930

* View Article

* PubMed/NCBI

* Google Scholar

27. 27. Lv H, Zhang Z-M, Li S-H, Tan J-X, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform. 2020;21(3):982–95. pmid:31157855

* View Article

* PubMed/NCBI

* Google Scholar

28. 28. Chai D, Jia C, Zheng J, Zou Q, Li F. Staem5: a novel computational approachfor accurate prediction of m5C site. Mol Ther Nucleic Acids. 2021;26:1027–34. pmid:34786208

* View Article

* PubMed/NCBI

* Google Scholar

29. 29. Bonidia RP, Machida JS, Negri TC, Alves WAL, Kashiwabara AY, Domingues DS, et al. A novel decomposing model with evolutionary algorithms for feature selection in long non-coding RNAs. IEEE Access. 2020;8:181683–97.

* View Article

* Google Scholar

30. 30. Wang H, Liu H, Huang T, Li G, Zhang L, Sun Y. EMDLP: ensemble multiscale deep learning model for RNA methylation site prediction. BMC Bioinform. 2022;23(1):221. pmid:35676633

* View Article

* PubMed/NCBI

* Google Scholar

31. 31. Linder J, La Fleur A, Chen Z, Ljubeti A, Baker D, Kannan S, et al. Interpreting neural networks for biological sequences by learning stochastic masks. Nat Mach Intell. 2022;4(1):41–54. pmid:35966405

* View Article

* PubMed/NCBI

* Google Scholar

32. 32. Waleed A,Hilal T,To K C.XG-ac4C: identification of N4-acetylcytidine (ac4C) in mRNA using eXtreme gradient boosting with electron-ion interaction pseudopotentials[J]. Sci Rep. 2020;10(1):20942.

* View Article

* Google Scholar

33. 33. Jia-Jia X,Wen-Ju S,Peng-Hui L, et al.RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data.[J]. Nucleic Acids Res. 2018;46(D1):D327–34.

* View Article

* Google Scholar

34. 34. Li J, Huang Y, Yang X, Zhou Y, Zhou Y. RNAm5Cfinder: a web-server for predicting RNA 5-methylcytosine (m5C) sites based on random forest. Sci Rep. 2018;8(1):17299. pmid:30470762

* View Article

* PubMed/NCBI

* Google Scholar

35. 35. Yang Y-H, Ma C, Wang J-S, Yang H, Ding H, Han S-G, et al. Prediction of N7-methylguanosine sites in human RNA based on optimal sequence features. Genomics. 2020;112(6):4342–7. pmid:32721444

* View Article

* PubMed/NCBI

* Google Scholar

36. 36. Klimo M, Lukáč P, Tarábek P. Deep neural networks classification via binary error-detecting output codes. Appl Sci. 2021;11(8):3563.

* View Article

* Google Scholar

37. 37. Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21(3):1047–57. pmid:31067315

* View Article

* PubMed/NCBI

* Google Scholar

38. 38. Zhang L, Qin X, Liu M, et al. DNN-m6A: a cross-species method for identifying RNA N6-methyladenosine sites based on deep neural network with multi-information fusion. Genes. 12(3):354.

* View Article

* Google Scholar

39. 39. Tayara H, Tahir M, Chong KT. iSS-CNN: identifying splicing sites using convolution neural network. Chemom Intell Lab Syst. 2019;188:63–9.

* View Article

* Google Scholar

40. 40. Qiu W-R, Jiang S-Y, Xu Z-C, Xiao X, Chou K-C. iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget. 2017;8(25):41178–88. pmid:28476023

* View Article

* PubMed/NCBI

* Google Scholar

41. 41. Khan A, Rehman HU, Habib U, Ijaz U. Detecting N6-methyladenosine sites from RNA transcriptomes using random forest. J Comput Sci. 2020;47:101238.

* View Article

* Google Scholar

42. 42. Vacic V, Iakoucheva LM, Radivojac P. Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006;22(12):1536–7. pmid:16632492

* View Article

* PubMed/NCBI

* Google Scholar

43. 43. Dou L, Li X, Ding H, et al. iRNA-m5C_NB: a novel predictor to identify RNA 5-Methylcytosine sites based on the Naive Bayes classifier, IEEE Access 2020;99:1–1.

* View Article

* Google Scholar

44. 44. Zhou L, Feng T, Xu S, Gao F, Lam TT, Wang Q, et al. ggmsa: a visual exploration tool for multiple sequence alignment and associated data. Brief Bioinform. 2022;23(4):bbac222. pmid:35671504

* View Article

* PubMed/NCBI

* Google Scholar

45. 45. Dong X, Du MRM, Gouil Q, Tian L, Jabbari JS, Bowden R, et al. Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures. Nat Methods. 2023;20(11):1810–21. pmid:37783886

* View Article

* PubMed/NCBI

* Google Scholar

46. 46. Feng X, Ma Z, Yu C, Xin R. MRNDR: multihead attention-based recommendation network for drug repurposing. J Chem Inf Model. 2024;64(7):2654–69.

* View Article

* Google Scholar

Citation: Han B, Bai S, Liu Y, Wu J, Feng X, Xin R (2025) Definer: A computational method for accurate identification of RNA pseudouridine sites based on deep learning. PLoS ONE 20(4): e0320077. https://doi.org/10.1371/journal.pone.0320077

About the Authors:

Bo Han

Roles: Conceptualization, Methodology, Writing – review & editing

Affiliation: Jilin Chemical Hospital, Jilin, P.R. China

Sudan Bai

Roles: Investigation, Methodology

Affiliation: College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, P.R. China

Yang Liu

Roles: Data curation, Software, Validation

Affiliation: Jilin Chemical Hospital, Jilin, P.R. China

Jiezhang Wu

Roles: Formal analysis, Visualization, Writing – original draft

Affiliation: College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, P.R. China

Xin Feng

Roles: Writing – original draft

E-mail: [email protected] (XF); [email protected] (RX)

Affiliation: School of Science, Jilin Institute of Chemical Technology, Jilin, P.R. China

ORICD: https://orcid.org/0000-0002-0016-2507

Ruihao Xin

Roles: Validation, Writing – review & editing

E-mail: [email protected] (XF); [email protected] (RX)

Affiliation: College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, P.R. China

References

1. Jack K, Bellodi C, Landry DM, Niederer RO, Meskauskas A, Musalgaonkar S, et al. rRNA pseudouridylation defects affect ribosomal ligand binding and translational fidelity from yeast to human cells. Mol Cell. 2011;44(4):660–6. pmid:22099312

2. Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV. Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature. 2014;515(7525):143–6. pmid:25192136

3. Pelchat M, Perreault J-P. Binding site of Escherichia coli RNA polymerase to an RNA promoter. Biochem Biophys Res Commun. 2004;319(2):636–42. pmid:15178453

4. Maroney PA, Romfo CM, Nilsen TW. Nuclease protection of RNAs containing site-specific labels: a rapid method for mapping RNA-protein interactions. RNA. 2000;6(12):1905–9. pmid:11142388

5. Basak A, Query CC. A pseudouridine residue in the spliceosome core is part of the filamentous growth program in yeast. Cell Rep. 2014;8(4):966–73. pmid:25127136

6. Mei Y-P, Liao J-P, Shen J, Yu L, Liu B-L, Liu L, et al. Small nucleolar RNA 42 acts as an oncogene in lung tumorigenesis. Oncogene. 2012;31(22):2794–804. pmid:21986946

7. Cohn WE. Pseudouridine, a carbon-carbon linked ribonucleoside in ribonucleic acids: isolation, structure, and chemical characteristics. J Biol Chem. 1960;235(5):1488–98.

8. Li X, Ma S, Yi C. Pseudouridine: the fifth RNA nucleotide with renewed interests. Curr Opin Chem Biol. 2016;33:108–16. pmid:27348156

9. Chan CM, Huang RH. Enzymatic characterization and mutational studies of TruD--the fifth family of pseudouridine synthases. Arch Biochem Biophys. 2009;489(1–2):15–9. pmid:19664587

10. Karijolich J, Yu Y-T. Converting nonsense codons into sense codons by targeted pseudouridylation. Nature. 2011;474(7351):395–8. pmid:21677757

11. Zhang Y, Hamada M. DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning. BMC Bioinform. 2018;19(Suppl 19):524. pmid:30598068

12. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2. pmid:20110278

13. Swami A, Jain R. Scikit-learn: machine learning in Python. J Machine Learn Res. 2013;12(10):2825–30.

14. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. pmid:16731699

15. Li Y-H, Zhang G, Cui Q. PPUS: a web server to predict PUS-specific pseudouridine sites. Bioinformatics. 2015;31(20):3362–4. pmid:26076723

16. Chen W, Tang H, Ye J, Lin H, Chou K-C. iRNA-PseU: identifying RNA pseudouridine sites. Mol Ther Nucleic Acids. 2016;5(7):e332. pmid:28427142

17. He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y. PseUI: pseudouridine sites identification based on RNA sequence information. BMC Bioinform. 2018;19(1):306. pmid:30157750

18. Tahir M, Tayara H, Chong KT. iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Mol Ther Nucleic Acids. 2019;16:463–70. pmid:31048185

19. Liu K, Chen W, Lin H. XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites. Mol Genet Genomics. 2020;295(1):13–21. pmid:31392406

20. Bi Y, Jin D, Jia C. EnsemPseU: identifying pseudouridine sites with an ensemble approach. IEEE Access. 2020;8:79376–82.

21. Lv Z, Zhang J, Ding H, Zou Q. RF-PseU: a random forest predictor for RNA pseudouridine sites. Front Bioeng Biotechnol. 2020;8:134. pmid:32175316

22. Khan SM, He F, Wang D, Chen Y, Xu D. MU-PseUDeep: a deep learning method for prediction of pseudouridine sites. Comput Struct Biotechnol J. 2020;18:1877–83. pmid:32774783

23. Song B, Tang Y, Wei Z, Liu G, Su J, Meng J, et al. PIANO: a web server for pseudouridine-site (Ψ) identification and functional annotation. Front Genet. 2020;11:88. pmid:32226440

24. Song B, Chen K, Tang Y, Ma J, Meng J, Wei Z. PSI-MOUSE: predicting mouse pseudouridine sites from sequence and genome-derived features. Evol Bioinform Online. 2020;16:1176934320925752. pmid:32565674

25. Li F, Guo X, Jin P, Chen J, Xiang D, Song J, et al. Porpoise: a new approach for accurate prediction of RNA pseudouridine sites. Brief Bioinform. 2021;22(6):bbab245. pmid:34226915

26. Niu M, Zhang J, Li Y, Wang C, Liu Z, Ding H, et al. CirRNAPL: a web server for the identification of circRNA based on extreme learning machine. Comput Struct Biotechnol J. 2020;18:834–42. pmid:32308930

27. Lv H, Zhang Z-M, Li S-H, Tan J-X, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform. 2020;21(3):982–95. pmid:31157855

28. Chai D, Jia C, Zheng J, Zou Q, Li F. Staem5: a novel computational approachfor accurate prediction of m5C site. Mol Ther Nucleic Acids. 2021;26:1027–34. pmid:34786208

29. Bonidia RP, Machida JS, Negri TC, Alves WAL, Kashiwabara AY, Domingues DS, et al. A novel decomposing model with evolutionary algorithms for feature selection in long non-coding RNAs. IEEE Access. 2020;8:181683–97.

30. Wang H, Liu H, Huang T, Li G, Zhang L, Sun Y. EMDLP: ensemble multiscale deep learning model for RNA methylation site prediction. BMC Bioinform. 2022;23(1):221. pmid:35676633

31. Linder J, La Fleur A, Chen Z, Ljubeti A, Baker D, Kannan S, et al. Interpreting neural networks for biological sequences by learning stochastic masks. Nat Mach Intell. 2022;4(1):41–54. pmid:35966405

32. Waleed A,Hilal T,To K C.XG-ac4C: identification of N4-acetylcytidine (ac4C) in mRNA using eXtreme gradient boosting with electron-ion interaction pseudopotentials[J]. Sci Rep. 2020;10(1):20942.

33. Jia-Jia X,Wen-Ju S,Peng-Hui L, et al.RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data.[J]. Nucleic Acids Res. 2018;46(D1):D327–34.

34. Li J, Huang Y, Yang X, Zhou Y, Zhou Y. RNAm5Cfinder: a web-server for predicting RNA 5-methylcytosine (m5C) sites based on random forest. Sci Rep. 2018;8(1):17299. pmid:30470762

35. Yang Y-H, Ma C, Wang J-S, Yang H, Ding H, Han S-G, et al. Prediction of N7-methylguanosine sites in human RNA based on optimal sequence features. Genomics. 2020;112(6):4342–7. pmid:32721444

36. Klimo M, Lukáč P, Tarábek P. Deep neural networks classification via binary error-detecting output codes. Appl Sci. 2021;11(8):3563.

37. Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21(3):1047–57. pmid:31067315

38. Zhang L, Qin X, Liu M, et al. DNN-m6A: a cross-species method for identifying RNA N6-methyladenosine sites based on deep neural network with multi-information fusion. Genes. 12(3):354.

39. Tayara H, Tahir M, Chong KT. iSS-CNN: identifying splicing sites using convolution neural network. Chemom Intell Lab Syst. 2019;188:63–9.

40. Qiu W-R, Jiang S-Y, Xu Z-C, Xiao X, Chou K-C. iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget. 2017;8(25):41178–88. pmid:28476023

41. Khan A, Rehman HU, Habib U, Ijaz U. Detecting N6-methyladenosine sites from RNA transcriptomes using random forest. J Comput Sci. 2020;47:101238.

42. Vacic V, Iakoucheva LM, Radivojac P. Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006;22(12):1536–7. pmid:16632492

43. Dou L, Li X, Ding H, et al. iRNA-m5C_NB: a novel predictor to identify RNA 5-Methylcytosine sites based on the Naive Bayes classifier, IEEE Access 2020;99:1–1.

44. Zhou L, Feng T, Xu S, Gao F, Lam TT, Wang Q, et al. ggmsa: a visual exploration tool for multiple sequence alignment and associated data. Brief Bioinform. 2022;23(4):bbac222. pmid:35671504

45. Dong X, Du MRM, Gouil Q, Tian L, Jabbari JS, Bowden R, et al. Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures. Nat Methods. 2023;20(11):1810–21. pmid:37783886

46. Feng X, Ma Z, Yu C, Xin R. MRNDR: multihead attention-based recommendation network for drug repurposing. J Chem Inf Model. 2024;64(7):2654–69.

Word count: 6666

Show less

© 2025 Han et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Definer: A computational method for accurate identification of RNA pseudouridine sites based on deep learning

Content area

Abstract

Full text

1. Introduction

2. Materials and methods

2.1. Overall framework

2.2. Benchmark data sets

2.3. Feature extraction

2.3.1. One-hot encoding.

2.3.2. NCP.

2.4. Deep learning model framework

2.5. Performance evaluations

3. Results and discussion

3.1. Distribution of nucleotide positions at pseudouridine sites

3.2. Performance comparison analysis of different feature extraction methods

3.3. Performance comparison analysis of different models

3.4. Comparative analysis on independent datasets

3.5. Performance comparison with state-of-the-art methods

3.6. Software engineering

4. Discussion

5. Conclusion

References