CodLncScape Provides a Self‐Enriching Framework

Full text

Turn on search term navigation

Introduction

Long non-coding RNAs (lncRNAs), a key class of RNA molecules widely distributed across the genome, are traditionally not considered to have protein-coding capabilities. Nonetheless, they are involved in various fundamental biological processes, such as signal transduction, post-transcriptional modification, and the formation of cellular structure, thereby playing an indispensable role in regulating gene expression and maintaining cellular functionality.^[¹^] Compared to other RNA molecules, lncRNAs are characterized by their structural and functional diversity, complex genomic locations, and generally low levels of endogenous expression. These attributes intensify the challenge of deciphering lncRNA functions.^[^1b,2^]

Recent studies have progressively revealed that certain lncRNAs contain open reading frames (ORFs) and, under specific spatiotemporal conditions, can be recognized by cellular translation mechanisms, resulting in the synthesis of short peptides or proteins. These molecules play roles in a variety of physiological and pathological processes, including cellular differentiation, tissue development, and tumor formation.^[³^] Such lncRNAs with coding potential are referred to as coding lncRNAs. For instance, as non-coding capacity, LINC00961 reduces β-catenin protein levels in Non-Small Cell Lung Cancer (NSCLC), thereby inhibiting cellular invasion and metastasis.^[⁴^] LINC00961 could also code the peptide SPAR and interact with lysosomal v-ATPase to negatively regulate mTORC1 activation.^[^3b^] Similarly, LINC00673 enhances the interaction between PTPN11 and PRPF19, an E3 ubiquitin ligase, thereby diminishing SRC-ERK oncogenic signaling in pancreatic cancer.^[⁵^] Except for its non-coding capacity, LINC00673 could code a novel peptide, RASON, which serves as an enhancer of oncogenic RAS signaling.^[⁶^] The widespread existence of translation events in lncRNAs has been further substantiated by techniques like ribosome profiling, high-throughput CRISPR screening, and mass spectrometry.^[⁷^] Taken together, the discovery of coding lncRNAs not only adds a new dimension to our understanding of the genomic coding landscape and its regulatory complexity but also blurs the distinction between coding and non-coding RNAs, prompting researchers to reconsider RNA functions and evolution.^[⁸^]

Currently, the exploration of translational events in lncRNAs and the identification of novel coding lncRNAs have emerged as focal areas in lncRNA research.^[^7a,9^] However, most of the individual coding lncRNA studies, while valuable, often only reveal a fraction of the real coding lncRNA universe. This limited scope has led to a gap in comprehensive and systematic research of coding lncRNAs, potentially leading to biases in our overall understanding of this field. To mitigate this issue, several studies have begun to aggregate coding lncRNAs and establish related databases, such as ncEP,^[¹⁰^] FuncPEP,^[¹¹^] EVLncRNAs 2.0,^[¹²^] cncRNAdb,^[^8a^] and LncPep.^[¹³^] Nevertheless, these databases contain a limited number of high-confidence coding lncRNA entries. Alongside these efforts, researchers also attempted to systematically identify endogenously translated peptides from coding lncRNAs, but functional investigations in this area are still limited.^[¹⁴^]

In light of the scarcity of systematic studies, we introduce codLncScape, a well-functioning and self-enriching framework containing an interconnected suite of resources and tools for coding lncRNA exploration. At its foundation, codLncScape incorporates codLncDB, a manually compiled knowledge base of coding lncRNAs. Leveraging codLncDB, we then conducted codLncFlow, a computational workflow of coding lncRNA to explore their implications in pathological and physiological conditions. Further enhancing its utility, we developed codLncWeb, a platform for storing, browsing, and retrieving data across various programming environments. Finally, we established a codLncNLP, a knowledge-mining tool that integrated natural language processing (NLP) models, such as prompt-learning ChatGPT-4, to enhance the timely inclusion and update of the content in codLncDB. This initiative is aimed at constructing a rich, scalable ecosystem focused on coding lncRNAs, with the intention of accelerating research in this area and reducing knowledge biases.

Results

The Overview of codLncScape

The general architecture of codLncScape encompassed 4 main components: the establishment of a coding lncRNA knowledge base (codLncDB), the exploration of coding lncRNAs in pathological (pan-cancer) and physiological (spermatogenesis) contexts (codLncFlow), the development of a platform for storing, browsing, and fetching data (codLncWeb), and the establishment of a knowledge-mining tool (codLncNLP) for timely content update (Figure 1). Fundamental to our study is the creation of a self-sustaining cycle of research that continually propels the field forward. From codLncDB to codLncFlow, through codLncWeb, to codLncNLP, and back to codLncDB, this cycle ensures ongoing contributions to lncRNA research.

[IMAGE OMITTED. SEE PDF]

Specifically, our initial step involved filtering articles from PubMed using keywords, supplemented by rigorous expert review, to create codLncDB, a manually curated knowledge-base of 353 coding lncRNA entries (supported by low-throughput experiments). Notably, 293 of these entries represent novel additions not cataloged in other databases (Figure S1, Supporting Information; Experimental Section). Our subsequent focus was to elucidate the implications of these coding lncRNAs in various contexts, including pathological and physiological conditions. We conducted a global coding lncRNA-centric clustering analysis and then examined their diagnostic potential in pan-cancer contexts. Alongside, we explored the differentiation-associated coding lncRNAs and tried to infer their functions during spermatogenesis. Then, to facilitate the widespread data analysis, we developed an online platform, codLncWeb, embedding database, R, and Python packages for storing, browsing, and fetching coding lncRNAs data. This platform facilitates research across various programming environments. Finally, we incorporated an array of machine-learning approaches, a pre-training model,^[¹⁵^] and a ChatGPT 4.0-based^[¹⁶^] prompt-learning model to establish a knowledge-mining tool and thereby ensure the continuous updating and expansion of the content in codLncDB.

Decoding the Role of Coding LncRNAs Across Diverse Cancers

Numerous studies have elucidated the significant correlation between coding lncRNAs and their translated micropeptides with the onset and progression of various cancers.^[¹⁷^] Thus, we systematically analyzed the expression of coding lncRNAs across diverse cancers and their relationship with cancer pathology. A total of 140 coding lncRNAs were detected in the TCGA pan-cancer dataset (9,784 samples from 33 cancer types). Upon performing dimension reduction analysis (t-SNE) based on the 140 coding lncRNAs, we observed an obvious cancer-type preference, as depicted in Figures 2a and S2 (Supporting Information). Consistent with previous findings,^[¹⁸^] we found that tumor samples sharing the same cellular origin also tended to cluster together (Figure 2b). Consequently, we employed an unsupervised graph-based clustering algorithm, which identified 22 distinct clusters among the tumor samples, denoted as cluster 0 to cluster 21 (Figure 2c). To elucidate the associations between these 22 clusters and the 33 cancer types, we formulated two evaluative metrics: The Tumor Clustering Score (TCS) represents whether a certain type of cancer tends to cluster together, while the Tumor Purity Score (TPS) indicates whether the samples within a cluster are likely to originate from a specific type of cancer. Our analysis revealed that 75% of cancer types achieved high TCS values (>86.03%) (Figure 2d; Figure S3a, Supporting Information). This finding underscored the potent efficacy of coding lncRNAs in tumor classification. Meanwhile, though 75% of clusters exhibited high TPS values (>70.10%), such as clusters 1, 5, 6, 20, and 21 (Figure 2e; Figure S3b, Supporting Information), others, like clusters 0, 2, and 18, displayed low TPS values (<50%). This disparity implied that certain clusters contain samples from multiple cancer types.

[IMAGE OMITTED. SEE PDF]

Subsequently, we focused on clusters exhibiting low TPS. Specifically, we scrutinized cluster 0, which comprised samples from various tumor types, such as Head and Neck Squamous Cell Carcinoma (HNSC), Lung Squamous Cell Carcinoma (LUSC), Cervical Squamous Cell Carcinoma (CESC), Bladder Urothelial Carcinoma (BLCA), and Esophageal Carcinoma (ESCA), as indicated in Figure 2f. Notably, over half of the samples in this cluster were from HNSC (33.38%) and LUSC (29.58%), both representing squamous cell carcinomas. This led us to speculate whether cluster 0 tended to aggregate various types of squamous cell carcinomas. Further analysis of the CESC and ESCA samples within this cluster substantiated our hypothesis, revealing that 233 out of 253 squamous cell samples from CESC and 83 out of 93 from ESCA were clustering in cluster 0 (Figure 2g). Furthermore, we quantified the enrichment of squamous cell samples in each cluster based on the ratio of observed to expected values (R_O/E), as illustrated in Figures 2h and S3c (Supporting Information). The results demonstrated significant enrichment of squamous cell samples from HNSC, LUSC, CESC, and ESCA in cluster 0. These findings implied that cluster 0 had a lower TPS, since it predominantly aggregated squamous cell carcinoma samples from different cancers, indirectly highlighting the ability of coding lncRNAs in identifying squamous cell carcinomas regardless of cancer origin.

Moreover, we extended our analysis to examine the capacity of coding lncRNAs for subtyping certain cancers. For instance, in the case of Breast Cancer (BRCA), we observed a distinct division of samples into two separate clusters: 82.79% (909 out of 1098) of the samples were categorized into cluster 0, while 17.21% (189 out of 1098) were allocated to cluster 1, as shown in Figure 2i. Clinical annotations based on the PAM50 molecular subtyping of BRCA elucidated the segregation of subtypes across these clusters: the Basal-like subtype was primarily concentrated in cluster 1, whereas the LumA, LumB, Her2, Normal, and unknown subtypes, which are clinically distinct from Basal-like, were predominantly located in cluster 0 (Figure 2j; Table S1, Supporting Information). The consistency of this clustering outcome was observed across the additional 2 molecular subtype identification methods (Figure S4, Supporting Information). Regarding gliomas, our cluster analysis postulated a possible alignment of Lower Grade Glioma (LGG) and Glioblastoma Multiforme (GBM) along a tumorigenic timeline continuum. To investigate this hypothesis, we conducted a pseudotime analysis, which revealed that LGG typically exhibited lower pseudotime values, while GBM presented higher values. This pattern potentially mirrors a disease progression trajectory in gliomas from LGG to GBM (Figure 2k,l). These findings imply that coding lncRNAs have the potential to distinguish between specific subtypes or clinical stages of certain cancers, thereby providing valuable insights into the molecular characterization and classification of various cancers.

Tumor Classification, Typing, and Grading Based on Coding LncRNAs

Given the potential of coding lncRNAs in tumor classification, typing, and grading, we designed a graph-based model for personalized cancer diagnosis. The model initially constructed a graph for every patient based on the relative expression rank between different coding lncRNAs. This graph was then employed as a feature input of each patient into an XGBoost model for disease diagnosis and subgraph analysis (Figure 3a).

[IMAGE OMITTED. SEE PDF]

In the task of predicting squamous cell carcinoma patients from ESCA and CESC, the model assigned high scores to squamous and low scores to non-squamous (Figure 3b; Table S2, Supporting Information). Furthermore, patients from the same cancer type exhibited more similar scores. The model demonstrated impressive performance (AUC: 0.9618, AUPR: 0.9764, and MCC: 0.8011, Figure 3c). Then, the SHAP value analysis identified the relationship (Edge, TINCR – LBX2-AS1) as the most crucial contributor to disease diagnosis (Figure 3d; Table S2, Supporting Information), with all contributing relationships forming a core subgraph and LINC00467 positioned at the hub center (Figure 3e). Similarly, in typing Basal and non-Basal patients from BRCA, the model exhibited robust performance (AUC: 0.9795, AUPR: 0.9646, and MCC: 0.8947, Figure 3f,g), with the relationship (Edge, LINC01116 – LINC00511) emerging as the most significant contributor to disease diagnosis, and LINC01116 at the subgraph hub (Figure 3h,i; Table S3, Supporting Information). In grading LGG and GBM patients, the model also displayed high predictive accuracy (Figure 3j,k). The relationship (Edge, RP11-295G20.2 – WAC-AS1) was identified as the most critical relationship, with LINC00467 at the center of the subgraph (Figure 3l,m; Table S4, Supporting Information). In addition, these identified coding lncRNAs exhibit distinct expression patterns across a range of cancer subtypes mentioned above, demonstrating a strong association with specific subtypes (Figure S5, Supporting Information).

Furthermore, we analyzed the relative value of the edge in the graph and the absolute value of coding lncRNA expression, according to the most crucial relationship identified in the above three tasks. The results showed that the relative value was more stable across different samples than the absolute value, which is crucial for clinical testing and wide applicability (Figure 3n).

Exploring Spermatogenesis-Related Coding LncRNAs

In addition to documenting associations with various diseases, recent research has also shed light on the role of coding lncRNA in cellular differentiation and development.^[^3a,19^] Consequently, we attempted to re-analyze scRNA-seq data of testicular cells consisting of 354 cells and 134 coding lncRNAs from early spermatogonia, including spermatogonial stem cell (SSC), differentiating spermatogonia (Diff.ing.SPG) and differentiated spermatogonia (Diff.ed.SPG). Dimensionality reduction analysis based on coding lncRNAs indicated a general trend of separation among cells at different spermatogenesis stages (Figure 4a; Figure S6, Supporting Information). To identify coding lncRNAs involved in spermatogenesis, we employed Spearman Correlation analysis and identified 13 coding lncRNAs significantly correlated with pseudotime (Figure 4b). Subsequent hierarchical clustering yielded two groups with divergent expression trends: cluster 1, where expression decreased during SSC differentiation, and cluster 2, which showed an opposite expression trend (Figure 4c). Additionally, we integrated canonical markers of sperm stemness (GFRA1, ZBTB16) and differentiation (KIT, SOHLH2) into our analysis alongside these 13 coding lncRNAs to explore similar expression pattern modules (Figure 4d). The results showed that cluster 1 aligned with genes maintaining stemness (GFRA1, ZBTB16, highlighted in red text), while cluster 2 aligned with genes promoting differentiation (KIT, SOHLH2, highlighted in red text). This correlation further underscores the role of these coding lncRNAs in the developmental process of spermatogenesis.

[IMAGE OMITTED. SEE PDF]

To further explore the functions of spermatogenesis-related coding lncRNAs, we specifically focused on TUNAR and MALAT1, from clusters 1 and 2, respectively. Based on the hypothesis that a peptide's function depends on its sequence, we analyzed peptides translated from TUNAR and MALAT1 by conducting BLAST searches in the UniProt database.^[²⁰^] This yielded 250 and 28 similar peptides for TUNAR and MALAT1, respectively (Tables S5 and S6, Supporting Information). For TUNAR, we showcased a portion of these alignment results (Figure 4e). Then, we inferred the functions of these peptides by analyzing the enriched biological process (BP) terms of their similar peptides. The results from the BP functional enrichment analysis suggested that TUNAR peptides might primarily be involved in transmembrane transport and cell surface receptor signaling pathways during spermatogenesis (Figure 4f; Figure S7a, Supporting Information). Furthermore, we analyzed the intercellular communication between SSCs and Sertoli cells (ST), finding that communication intensity progressively diminished during spermatogenesis, paralleling the expression trend of TUNAR (Figure 4h). Meanwhile, peptides similar to MALAT1 were significantly enriched in biosynthetic processes and respiratory electron transport chains, suggesting a potential involvement of MALAT1 in cell metabolism-related processes (Figure 4g; Figure S7b, Supporting Information). Further single-cell pathway analysis of 8 metabolic pathways revealed significant dynamic alterations in metabolic processes during spermatogonial development (Figure 4i).

Development of the Platform for Broader Data Analysis

To advance the widespread application and analysis of coding lncRNAs, we have established codLncWeb (), an online platform dedicated to storing, browsing, and exploring coding lncRNAs. codLncWeb incorporated the knowledge of 353 experimentally validated coding lncRNAs, complete with detailed annotations such as gene and transcript IDs, ORFs, peptide sequences, and associated reference lists (Figure 5a,b). codLncWeb delivered an exceptional user experience through its asynchronous database architecture, ensuring rapid data retrieval while concurrently upholding stringent user privacy via the HTTPS protocol (Figure 5c). Moreover, codLncWeb also embedded the packages designed for compatibility with the R and Python programming environments (Figure 5d). The core structure of packages, LncRNAData, enabled the extraction of necessary information using functions like getRNA, getPeptide, and getORF. Finally, we deployed the above tools to an easily accessible website, complete with usage instructions (Figure 5e).

[IMAGE OMITTED. SEE PDF]

Based on the codLncDB, we analyzed the characteristics and subcellular location of coding lncRNAs, revealing differences between coding and non-coding lncRNAs. Specifically, we observed variances of sequence characteristics between the two lncRNA classes, but not obvious (Figure 5f). Furthermore, we predicted the subcellular location of these lncRNAs and found that they have a greater tendency to localize with ribosomes than non-coding lncRNAs (Figure 5g).

Continuously Collecting Coding LncRNA with a Knowledge-Mining Tool

The field of research on coding lncRNAs is rapidly expanding, with new coding lncRNAs being continually identified.^[^7b,21^] However, relying solely on manual mining is time-consuming and inefficient. To address this challenge and facilitate the timely update, we have implemented a novel text mining approach, codLncNLP. This method synergizes expert review with state-of-the-art NLP models to streamline the data-updating process (Figure 6a). Our integration included traditional machine learning methods such as logistic regression, support vector machines (SVM), and random forest, along with the pre-training model, Bioformer,^[¹⁵^] and the prompt-learning model based on ChatGPT 4.0.^[¹⁶^] Given the limited scope of the benchmark dataset (160 sentences, see detail in the Experimental Section), we adopted a few-shot evaluation strategy. This strategy involved using varying percentages of the dataset as the training set, ranging from 10% to 50%, the remaining as the testing set. Overall, the pre-training model, Bioformer, consistently surpassed other models in terms of AUC, AUPR, and MCC across different training set sizes (Figure 6b–d; Table S7, Supporting Information). In addition, the logistic regression showed higher AUC and AUPR values but lagged in MCC. The prompt-learning model displayed the reverse pattern, with lower AUC and AUPR but higher MCC.

[IMAGE OMITTED. SEE PDF]

Discussion

Recent studies have extensively demonstrated that under certain conditions, some lncRNAs are capable of producing peptides, playing varied biological roles in both disease and physiological processes.^[²²^] However, current research often concentrates on the identification and functional analysis of translational events in individual lncRNAs, leading to a noticeable gap in the systematic exploration and analysis of coding lncRNAs. Therefore, this study introduces codLncScape, a well-functioning and self-enriching framework designed to aid in the research of coding lncRNAs. codLncScape begins by constructing a manually curated dataset of coding lncRNAs, followed by a systematic investigation into their expression and function in both pathological and physiological contexts. This represents a preliminary compilation and summary of currently identified coding lncRNAs, offering a solid foundation for the future identification and functional analysis of new coding lncRNAs. Moreover, codLncScape developed a platform, releasing the database and packages for use in various programming environments, along with the establishment of NLP tools to assist in the timely integration and updating of knowledge. The ultimate goal of codLncScape is to build a rich and fully functional portal where researchers can freely and conveniently access, learn, and exchange the latest advancements and knowledge on coding lncRNAs, thereby further advancing the research in this field.

Our study in the pan-cancer genome revealed that coding lncRNAs not only demonstrated good classification capabilities for different tumor types and cell origins but also distinguished certain tumor subtypes and clinical stages. Furthermore, the personalized diagnosis model developed based on coding lncRNAs further confirmed the predictive effectiveness of coding lncRNAs in tumor subtyping and staging. These results suggest that coding lncRNAs are closely associated with the development of various cancers and hold the potential for clinical early warning and diagnosis. Similarly, systematic analysis of coding lncRNAs in spermatogenesis-related scRNA-seq data identified several coding lncRNAs highly related to the spermatogenic process. Further analysis exploring the functions of peptides expressed by these coding lncRNAs indicated that these translation products might be involved in intercellular communication and metabolic processes during spermatogenesis. These findings suggest that coding lncRNAs, through their translated peptides, may play a role in regulating cellular differentiation and development, meriting further investigation.

Additionally, this study systematically collated and annotated currently identified coding lncRNAs and developed an integrated data resource platform. On this platform, a series of data management and analysis toolkits were developed, to fill the gap among different programming environments and enhance interdisciplinary communication. Our findings also showed that coding lncRNAs are preferred for coding over other RNA sequences, in terms of specific intrinsic RNA sequence attributes and subcellular localizations. Feature extraction and constructing models specifically for coding lncRNAs may pose a significant challenge in future research endeavors.^[²³^] Furthermore, this study also integrated machine learning, pre-training models, and prompt-learning methods to develop a knowledge-mining tool for coding lncRNAs, self-enriching the content of codLncDB timely and efficiently.

This study still has some limitations. First, the collected coding lncRNAs, validated by low-throughput experiments, are limited in number, and inevitably carry certain research biases, such as many documented coding lncRNAs are related to tumorigenesis. Additionally, our current research predominantly focuses on the expression of coding lncRNAs in various physiological and pathological states, yet it lacks a comprehensive analysis of the translational dynamics and specific functions of the peptides these lncRNAs encode. In the future, we plan to further collect coding lncRNA data, expand our dataset, and enhance the prediction and exploration of the functions of their translated peptides.

In summary, this study not only compiled a manually curated knowledge base of coding lncRNAs but also systematically explored their roles in disease and physiology. It developed and integrated a series of tools and workflows for data collection, management, and analysis around coding lncRNAs. Our aim is to construct a user-friendly and content-rich ecosystem focused on this research hotspot, progressively providing data foundations and technical support for the mechanistic and functional elucidation of coding lncRNAs.

Experimental Section

Establishment of codLncDB

The coding lncRNA knowledge base, codLncDB, which the study developed, was meticulously curated from the literature and five distinct databases. For curation, the following keyword combinations were used to filter publications in PubMed (mainly from 2018 to 2023): (encoded [Title/Abstract]) AND (lncRNA [Title/Abstract]), (peptide [Title/Abstract]) AND (lncRNA [Title/Abstract]), (translation [Title/Abstract]) AND (lncRNA [Title/Abstract]). Then, all retrieved publications were preliminarily reviewed by expert curators to filter out false-positive papers. Only the experimentally supported coding lncRNAs (the peptide was detected by low throughput experiments) were collected, resulting in 293 entries. Additionally, another 60 coding lncRNA entries were integrated from five databases, including ncEP,^[¹⁰^] FuncPEP,^[¹¹^] EVLncRNAs 2.0,^[¹²^] cncRNAdb,^[^8a^] and LncPep.^[¹³^] At last, 353 entries of coding lncRNA entries (337 human, 16 mouse) were documented, involving 329 lncRNAs (Figure S1, Supporting Information).

To unify the coding lncRNA from multiple sources in authoritative reference databases, we mapped all the lncRNAs to five databases (Ensembl,^[²⁴^] NCBI Gene,^[²⁵^] RNAcentral,^[²⁶^] NONCODE,^[²⁷^] and LNCipedia^[²⁸^]) to annotate coding lncRNA and their corresponding transcripts. In the annotation process, the IDs used in the original literature were prioritized to ensure accuracy and consistency throughout the annotation process. Additionally, the codLncDB database was compared against existing resources,^[^{8a,10–13,29}^] to underscore its distinctive advantages. This analysis delved into three critical dimensions: the source of evidence, involved species, and data categories. The outcomes of this comparison are meticulously detailed in Figure S8 (Supporting Information).

Pan-Cancer RNA Sequencing Data Collection

The pan-cancer transcriptome data of TCGA was downloaded from the pancancer_xena platform,^[³⁰^] which involves 33 types of cancer, sourced from the IlluminaHiSeq platform. All data analyses in this study were based on TPM values. The clinical metadata, including the cell origin and cancer subtype, were also compiled for the entire cohort.

Pan-cancer RNA Sequencing Data Processing

The 329 coding lncRNAs were intersected with genes detected in the pan-cancer transcriptome. This process required the inclusion of lncRNAs detected in at least three samples for further analysis. Subsequently, a cohort of 140 coding lncRNAs exhibiting this overlap was obtained, culminating in a profile including 9,784 samples across 33 cancer types.

To further analyze pan-cancer data, the cluster analysis strategy was used in Seurat4 package (v4.2.0).^[³¹^] The log-transformed (log1p) TPM expression data was assigned to the data slot of the Seurat object. To extract features of the coding lncRNA profile in pan-cancer, the Seurat functions ScaleData and RunPCA were employed. Given the insufficiency of original features, the parameter ‘features’ was set to 140 in RunPCA. Then, the Seurat function FindNeighbors (dims = 1:20) was used to construct a shared nearest neighbor graph and the Seurat function FindClusters (resolution = 0.5) to identify clusters. PCs 1–20 were selected to perform t-SNE analysis with the Seurat function RunTSNE (perplexity = 30).

Tumor Clustering and Purity

To explore the relationship between different cancer types and clusters, TCS and TPS were calculated for all clusters. First, for the i-th cancer type, the TCS is calculated as follows: 1 $\begin{equation}TC{{S}_{i,j}} = ({{T}_{i,j}}\ / {{T}_i}) \times 100{\mathrm{\% }}\end{equation}$ 2 $\begin{equation}TC{{S}_i} = \ max\left( {TC{{S}_{i,j}}{\mathrm{\ |\ }}j \in \left\{ {1,2,,,m} \right\}} \right)\end{equation}$ where T_i is the total number of samples for the i-th cancer type, and T_i,j is the number of samples of the i-th cancer type within the j-th cluster. TCS_i,j represents the percentage of the i-th cancer type sample assigned to the j-th cluster and TCS_i represents the max percentage of the i-th cancer type across all m clusters; a higher TCS_i value indicates stronger clustering ability. TCS represents the clustering ability of coding lncRNAs for a specific cancer type; a higher TCS value indicates a stronger clustering ability.

Simultaneously, for the j-th cluster, the TPS is computed as follows: 3 $\begin{equation}TP{{S}_{i,j}} = ({{T}_{i,j}}\ / {{C}_j}) \times 100{\mathrm{\% }}\end{equation}$ 4 $\begin{equation}TP{{S}_j} = \ max\left( {TP{{S}_{i,j}}{\mathrm{\ |\ }}i \in \left\{ {1,2,,,n} \right\}} \right)\end{equation}$ where C_j is the number of samples contained within the j-th cluster, and T_i,j is the number of samples of the i-th cancer type within the j-th cluster. TPS_i,j represents the percentage of the i-th cancer type sample in the j-th cluster and TPS_j represents the max percentage of the j-th cluster across all n cancer types; a higher TPS_j value indicates higher tumor purity. If the TPS for any given cluster exceeds 50%, it suggests that the cluster has a dominant cancer type and can be termed a cancer-associated cluster. If the TPS is below 50%, the cluster is considered a hyper-cluster composed of a mix of various cancer types.

Enrichment Score of Squamous Cell Samples across Clusters

For analyzing the enrichment of squamous cell samples across different clusters, we used the R_O/E (Observed value divided by expected value) method to evaluate the enrichment between clusters and tumor types. The expected value, E_i,j, represents the number of the i-th cancer type squamous cell samples expected to appear in the j-th cluster under random conditions. In contrast, the observed value, O_i,j, represents the number of the i-th cancer type squamous cell samples in j-th cluster. The $R_{O / E}^{i,j}$ is calculated as follows: 5 $\begin{equation}{{E}_{i,j}} = \left( {\frac{{{{T}_j}}}{T}} \right)\ \times {{S}_i}\end{equation}$ 6 $\begin{equation}R_{O / E}^{i,j} = \frac{{{{O}_{i,j}}}}{{{{E}_{i,j}}}}\end{equation}$ where T_j represents the total number of samples in the j-th cluster; T represents the total number of samples across different clusters; S_i represents the number of squamous cell samples for the i-th cancer type; $R_{O / E}^{i,j}$ represents the enrichment score of squamous cell samples between the i-th cancer type and the j-th cluster. If the enrichment score is >1, it indicates an enrichment relationship between a cluster and a tumor type.

Molecular Subtype Analysis of Breast Cancer

TCGA's annotated breast cancer molecular subtype clinical data (PAM50_mRNA_nature2012 and PAM50Call_RNAseq) was used to analyze 518 and 843 samples, post-exclusion of unknown data. Basal subtype samples were designated class 1 (positive samples), and non-basal subtype samples were class 0 (negative samples). The unsupervised clustering outcomes (clusters 0 and 1) served as predictive variables, facilitating the computation of accuracy, F1 score, Matthew's correlation coefficient (MCC), precision, and recall under both molecular subtypes. Furthermore, the basal subtype signature, M8124 in MSigDB,^[³²^] was employed using single-sample enrichment analysis methods to calculate each sample's basal score, correlating higher scores with basal subtype similarity. Then, we computed correlation coefficients and statistical significance of the two-sided Spearman test.

Graph-Based XGBoost Diagnosis Model

To construct a cancer diagnostic model based on coding lncRNAs, the study developed a machine-learning model, codLncDisease, which used a graph composed of coding lncRNAs as the feature input for each patient into the XGBoost model. The specific workflow is as follows:

For a coding-lncRNA graph G, the study defines: 7 $\begin{equation}G\ = \left( {V,E,W} \right)\end{equation}$ 8 $\begin{equation}V\ = \left\{ {{{v}_1},{{v}_2}, \ldots ,\left. {{{v}_n}} \right\}} \right.\end{equation}$ 9 $\begin{equation}E\ = \left\{ {{{e}_{ij}}{\mathrm{|}}{{v}_i},{{v}_j} \in V,i < j} \right\}\end{equation}$ 10 $\begin{equation}W:\ E \times I \to \left\{ { - 1,0,1} \right\}\end{equation}$

The weight function ${{W}_{( {{{e}_{ij}},k} )}}$ is set according to the following rules: 11 $\begin{equation}{{W}_{\left( {{{e}_{ij}},k} \right)}} = \left\{ { \def\eqcellsep{&}\begin{array}{@{}*{1}{l}@{}} {1,\ if\ {{x}_{i,k}} - {{x}_{j,k}} > 1}\\ { - 1,\ if\ {{x}_{i,k}} - {{x}_{j,k}} < - 1}\\ {0,\ otherwise} \end{array} } \right.\end{equation}$ where G is the graph used as input for the codLncDisease model, V represents the set of nodes encompassing all the coding-lncRNAs {v₁,v₂,…, v_n}, E represents the set of edges between nodes, and e_ij indicates the edge between nodes v_i and v_j, with the count being $C_n^2$ . The weight function W assigns a weight to each edge in E for each patient. x_i,k and x_j,k are the expression values of lncRNAs v_i and v_j, within patient k, assigning different weights to the edge based on their expression difference. By employing this strategy, which utilizes the relative ranking information between different coding lncRNAs to construct a tumor diagnostic model, the impact of experimental techniques on the model's performance can be significantly reduced, effectively enhancing the model's robustness. Ultimately, we feed the graph into the XGBoost model for accurate prediction of tumor types: 12 $\begin{equation}XGBoost\ \left( G \right) = \ XGBoost\left( {V,E,W} \right)\end{equation}$

The codlncDisease model allocates scores ranging from 0 to 1 for every sample. The performance metrics for the models include AUC (Area Under Curve), AUPR (Area Under Precision-Recall Curve), and MCC (Matthews Correlation Coefficient), all implemented via the sklearn.metrics module in the scikit-learn package (v1.1.1).

Identified Core Subgraph of Coding-LncRNA Graph

SHAP value was a game-theoretic approach to explaining the machine learning model. The SHAP module (v0.43.0) was employed to unearth the core subgraph within the coding-lncRNA graph. The SHAP function explainer was used to compute the model's feature importance score (Mean|SHAP), indicating the average contribution of each feature to the model's prediction, with larger values representing greater importance. Then, all features were selected with SHAP values >0 to create a core subgraph.

Spermatogenesis scRNA-seq Data Collection and Processing

The scRNA-seq data of human spermatogenesis by modified smart-seq2 technology were collected from the NCBI GEO database (GSE106487).^[³³^] The study obtained 354 cells across stages of SSC, Diff.ing.SPG, Diff.ed.SPG and the expression levels were normalized by log2[TPM/10+1] (transcripts per million, TPM). Then an intersection analysis of 329 coding lncRNAs was pursued with the genes detected within the spermatogenesis transcriptome, requiring each lncRNA to be expressed in at least three cells for further analysis. Subsequently, a cohort of 134 coding lncRNAs exhibiting this overlap was obtained, culminating in a coding lncRNA profile that included 354 cells across three stages.

The scRNA-seq data were processed by Seurat4 (v4.2.0). The log-transformed (log1p) TPM expression data was assigned to the data slot of the Seurat object. To extract the primary features of the coding lncRNA profile, the Seurat functions ScaleData and RunPCA were employed. Given the insufficiency of original features, the parameter ‘features’ was set to 134 in RunPCA. Then PCs 1–20 were selected for t-SNE analysis with the Seurat function RunTSNE (perplexity = 70). Cell identities were obtained from the original paper.

Correlation Analysis of Coding LncRNA

To identify the coding lncRNA associated with pseudotime, the stats package (v4.2.1) was used to calculate Spearman correlation coefficients between gene expression and pseudotime. The study screened for pseudotime-associated coding lncRNA by setting the threshold (|r| > 0.2, p < 0.05). Next, the canonical marker genes of SSC stemness, GFRA1, and ZBTB16 were combined with the canonical marker genes of spermatogonia differentiation, KIT and SOHLH2, as well as the pseudotime-associated coding lncRNA. Spearman correlation coefficients and performed unsupervised clustering were calculated using the corrplot package (v0.92).

BLAST Analysis of Peptides Encoded by Coding LncRNAs

To search for peptide sequences analogous to those translated from TUNAR (ORF 446–592) and MALAT1 (ORF 3086–3229), the “blastp” tool and selected “UniProtKB reference proteomes” along with “Swiss-Prot” were employed as the reference databases (including both reference proteomes and proteins that were reviewed).^[³⁴^] The E-value threshold was set at 10, consistent with the default settings.

Functional Enrichment of LncRNA-coding Peptide

To elucidate the potential biological functions of the lncRNA-coding peptides, the Gene Ontology biological process term (BP term)^[³⁵^] of similar peptides was used as possible functions for the peptide of TUNAR and MALAT1. To make the functional annotation easier to understand and streamlined, the rrvgo package (v1.8.0)^[³⁶^] was used to calculate the similarity between different BP terms. Subsequently, the study clustered the BP terms and nominated the most prominent BP term within each cluster as the representative (Parent term).

Analysis of Cell Communication

For the analysis of cellular communication, ST cells were incorporated, which were integral in establishing the early microenvironment for SSC in the testicular niche. Utilizing the CellCall package (v1.0.0),^[³⁷^] the complete expression profile across the various stages of spermatogenesis was analyzed, including SSC, Diff.ing.SPG, Diff.ed.SPG, and ST cells. Subsequently, the TransCommuProfile and ViewInterCircos functions were applied within the CellCall package to deduce potential ligand-receptor interactions among the cell types and to evaluate the comprehensive communication intensity.

Analysis of Metabolic Pathway Activity

To determine the activation of each sub-metabolic pathway in every cell, the AUCell package (v1.18.1)^[³⁸^] was employed to calculate metabolic pathway activity and used a two-sided t-test in the stats package (v4.2.1) to assess the statistical significance between the two groups. A total of 8 metabolism pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) were collected for the activity analysis,^[³⁹^] including Metabolic pathways (hsa01100), Carbon metabolism (hsa01200), 2-Oxocarboxylic acid metabolism (hsa01210), Fatty acid metabolism (hsa01212), Biosynthesis of amino acids (hsa01230), Nucleotide metabolism (hsa01232), Biosynthesis of nucleotide sugars (hsa01250), and Biosynthesis of cofactors (hsa01240) (Table S8, Supporting Information).

Development of codLncWeb

The architecture of the platform was built upon the HTTPS protocol, ensuring secure access for users. Its search functionality was powered by the DataTables framework, leveraging JavaScript for asynchronous data querying, thus optimizing both search efficiency and user-friendliness. The platform embedded an innovative resource manager codLncPackage tailored for compatibility with the R and Python programming environments. Its core structure, LncRNAData, encapsulates comprehensive information on transcripts, peptides, and ORFs, allowing users to effortlessly extract necessary information using methods like getRNA, getPeptide, and getORF. To support and guide users, the study provided extensive documentation and tutorials, available on the official server, ensuring that the platform was not only functional but also user-centric and accessible.

Sequence Features and Subcellular Localization

For the analysis of lncRNA sequence features and subcellular localization, the data was downloaded from GENCODE V44, selectively removing lncRNAs listed in codLncDB. After excluding 282 human lncRNAs identified in codLncDB, another 282 lncRNAs were randomly selected as non-coding samples for analysis. Forty-nine features of lncRNAs were analyzed as previous research,^[⁴⁰^] categorizing them into sequence intrinsic properties, physicochemical characteristics, and secondary structures. For subcellular localization predictions, the iLoc-lncRNA tool was utilized,^[⁴¹^] covering seven locations: cytoplasm, cytosol, exosome, nucleolus, nucleoplasm, nucleus, and ribosome. This enabled a comparative analysis of the localization tendencies between coding and non-coding lncRNAs.

Development of codLncNLP

A corpus was constructed for the text mining of coding lncRNAs. Initially, 80 positive sentences containing confirmed information about coding lncRNAs were curated. These sentences were manually selected from 77 research papers related to coding lncRNAs. In parallel, 80 negative sentences were compiled, which do not contain information about coding lncRNAs, to serve as negative samples. These were randomly chosen by manually selected from publications found on PubMed using the keyword “lncRNA”. Subsequently, the nltk package was employed for text preprocessing, which included removing special characters, deleting non-linguistic content, tokenization, and lemmatization. This process resulted in a balanced text corpus comprising 160 sentences (80 positives vs. 80 negatives), which is available in Table S9 (Supporting Information).

codLncNLP integrated 5 NLP models, including three machine-learning methods (logistic regression, SVM, and random forest), a pre-training model (Bioformer),^[¹⁵^] and a prompt-learning model (ChatGPT4.0). The machine learning models are implemented in the scikit-learn package (v1.1.1). For the pre-training model, we fine-tuned the bioformers (bioformer-8L) with our corpus, utilizing the simpletransformers package (v0.64.3). Moreover, a prompt-learning model was developed based on the ChatGPT-4.0 architecture,^[¹⁶^] with detailed prompts available in Table S10 (Supporting Information). The performance of different models was assessed using the AUC, AUPR, and MCC metrics in sklearn.metrics module.

Statistical Analysis

All statistical analyses were conducted using R version 4.2.1 (). Hierarchical clustering analysis for heatmaps was performed using the “euclidean” and “complete” methods, as implemented in the R package pheatmap (v1.0.12). Pseudotime analysis is performed by the monocle package (v2.24.1).^[⁴²^] Spearman's rank correlation coefficient was employed to measure the relationship between two variables, with the corresponding significance assessed using a two-sided hypothesis test (|r| > 0.2, p < 0.05), as implemented in the R package corrplot (v0.92). The significant analysis of metabolic pathway activation scores was performed using the t-test, as implemented in the R package stats (v4.2.1). All p-values were two-sided, with p < 0.05 indicating a statistically significant difference.

Data Availability

The dataset of coding lncRNAs used in this study is freely available at . The pan-cancer data used in this study is available at the UCSC Xena Database (), and scRNA-seq data of spermatogenesis is available at the NCBI Gene Expression Omnibus (GEO) database (GSE106487).^[³³^]

Code Availability

The codLncWeb is freely accessible at . All source code used in this paper is also deposited at .

Acknowledgements

This study was supported by the National Natural Science Foundation of China (62131004 and 62202069), the Natural Science Foundation of Sichuan Province (2022NSFSC1610 and 2023NSFSC0678), the China Postdoctoral Science Foundation (2023M730507), the Sichuan Province Postdoctoral Research Project Special Support Foundation (TB2023012), the JST SPRING (JPMJSP2124), the JSPS KAKENHI (JP23H03411 and JP22K12144), and the JST (JPMJPF2017). During the preparation of this work, the authors used ChatGPT 4.0 to improve language and readability. After using this service, the authors reviewed and edited the content as needed and took full responsibility for the content of the publication.

Conflict of Interest

The authors declare no conflict of interest.

Author Contributions

Y.Z., H.L., and X.Y. conceived this project. T.L., H.Q., and Z.W. designed and performed the experiments. X.Y. helped with data interpretation and visualization. Y.Z. and T.L. wrote the manuscript. Z.W., X.P., and Y.Y. helped with manuscript reviewing. Y.Z. and T.S. supervised this project.

Data Availability Statement

The data that support the findings of this study are available in the supplementary material of this article.

References

a) T. Ali, P. Grote, Elife 2020, 9, [eLocator: e60583];

b) J. S. Mattick, P. P. Amaral, P. Carninci, S. Carpenter, H. Y. Chang, L. L. Chen, R. Chen, C. Dean, M. E. Dinger, K. A. Fitzgerald, T. R. Gingeras, M. Guttman, T. Hirose, M. Huarte, R. Johnson, C. Kanduri, P. Kapranov, J. B. Lawrence, J. T. Lee, J. T. Mendell, T. R. Mercer, K. J. Moore, S. Nakagawa, J. L. Rinn, D. L. Spector, I. Ulitsky, Y. Wan, J. E. Wilusz, M. Wu, Nat. Rev. Mol. Cell Biol. 2023, 24, 430.

a) P. E. Saw, X. Xu, J. Chen, E. W. Song, Sci China Life Sci 2021, 64, 22;

b) B. Chen, M. P. Dragomir, C. Yang, Q. Li, D. Horst, G. A. Calin, Signal Transduct Target Ther 2022, 7, 121.

a) S. Mise, A. Matsumoto, K. Shimada, T. Hosaka, M. Takahashi, K. Ichihara, H. Shimizu, C. Shiraishi, D. Saito, M. Suyama, T. Yasuda, T. Ide, Y. Izumi, T. Bamba, T. Kimura‐Someya, M. Shirouzu, H. Miyata, M. Ikawa, K. I. Nakayama, Nat. Commun. 2022, 13, 1071;

b) A. Matsumoto, A. Pasut, M. Matsumoto, R. Yamashita, J. Fung, E. Monteleone, A. Saghatelian, K. I. Nakayama, J. G. Clohessy, P. P. Pandolfi, Nature 2017, 541, 228;

c) N. Meng, M. Chen, D. Chen, X. H. Chen, J. Z. Wang, S. Zhu, Y. T. He, X. L. Zhang, R. X. Lu, G. R. Yan, Adv. Sci. 2020, 7, [eLocator: 1903233];

d) X. Wang, H. Zhang, S. Yin, Y. Yang, H. Yang, J. Yang, Z. Zhou, S. Li, G. Ying, Y. Ba, EMBO Rep. 2022, 23, [eLocator: e53140].

B. Jiang, J. Liu, Y. H. Zhang, D. Shen, S. Liu, F. Lin, J. Su, Q. F. Lin, S. Yan, Y. Li, W. D. Mao, Z. L. Liu, Biomed. Pharmacother. 2018, 97, 1311.

J. Zheng, X. Huang, W. Tan, D. Yu, Z. Du, J. Chang, L. Wei, Y. Han, C. Wang, X. Che, Y. Zhou, X. Miao, G. Jiang, X. Yu, X. Yang, G. Cao, C. Zuo, Z. Li, C. Wang, S. T. Cheung, Y. Jia, X. Zheng, H. Shen, C. Wu, D. Lin, Nat. Genet. 2016, 48, 747.

R. Cheng, F. Li, M. Zhang, X. Xia, J. Wu, X. Gao, H. Zhou, Z. Zhang, N. Huang, X. Yang, Y. Zhang, S. Shen, T. Kang, Z. Liu, F. Xiao, H. Yao, J. Xu, C. Yan, N. Zhang, Cell Res. 2023, 33, 30.

a) J. S. Kesner, Z. Chen, P. Shi, A. O. Aparicio, M. R. Murphy, Y. Guo, A. Trehan, J. E. Lipponen, Y. Recinos, N. Myeku, X. Wu, Nature 2023, 617, 395;

b) J. Chen, A. D. Brunner, J. Z. Cogan, J. K. Nunez, A. P. Fields, B. Adamson, D. N. Itzhak, J. Y. Li, M. Mann, M. D. Leonetti, J. S. Weissman, Science 2020, 367, 1140;

c) P. Patraquim, E. G. Magny, J. I. Pueyo, A. I. Platero, J. P. Couso, Nature Commun. 2022, 13, 6515;

d) J. Bazin, K. Baerenfaller, S. J. Gosai, B. D. Gregory, M. Crespi, J. Bailey‐Serres, Proc. Natl. Acad. Sci. USA 2017, 114, [eLocator: E10018];

e) C. Chong, M. Müller, H. Pak, D. Harnett, F. Huber, D. Grun, M. Leleu, A. Auger, M. Arnaud, B. J. Stevenson, J. Michaux, I. Bilic, A. Hirsekorn, L. Calviello, L. Simó‐Riudalbas, E. Planet, J. Lubiński, M. Bryśkiewicz, M. Wiznerowicz, M. Bassani‐Sternberg, Nature Commun. 2020, 11, 1293;

f) P. Zhang, D. He, Y. Xu, J. Hou, B. F. Pan, Y. Wang, T. Liu, C. M. Davis, E. A. Ehli, L. Tan, F. Zhou, J. Hu, Y. Yu, X. Chen, T. M. Nguyen, J. M. Rosen, D. H. Hawke, Z. Ji, Y. Chen, Nat. Commun. 2017, 8, 1749.

a) Y. Huang, J. Wang, Y. Zhao, H. Wang, T. Liu, Y. Li, T. Cui, W. Li, Y. Feng, J. Luo, J. Gong, L. Ning, Y. Zhang, D. Wang, Y. Zhang, Nucleic Acids Res. 2021, 49, D65;

b) S. W. Choi, H. W. Kim, J. W. Nam, Brief Bioinform 2019, 20, 1853;

c) C. C. R. Hartford, A. Lal, Mol. Cell. Biol. 2020, 40, [eLocator: e00528].

W. Barczak, S. M. Carr, G. Liu, S. Munro, A. Nicastri, L. N. Lee, C. Hutchings, N. Ternette, P. Klenerman, A. Kanapin, A. Samsonova, N. B. La Thangue, Nat. Commun. 2023, 14, 1078.

H. Liu, X. Zhou, M. Yuan, S. Zhou, Y. E. Huang, F. Hou, X. Song, L. Wang, W. Jiang, J. Mol. Biol. 2020, 432, 3364.

M. P. Dragomir, G. C. Manyam, L. F. Ott, L. Berland, E. Knutsen, C. Ivan, L. Lipovich, B. M. Broom, G. A. Calin, Noncoding RNAs 2020, 6, 41.

B. Zhou, B. Ji, K. Liu, G. Hu, F. Wang, Q. Chen, R. Yu, P. Huang, J. Ren, C. Guo, H. Zhao, H. Zhang, D. Zhao, Z. Li, Q. Zeng, J. Yu, Y. Bian, Z. Cao, S. Xu, Y. Yang, Y. Zhou, J. Wang, Nucleic Acids Res. 2021, 49, D86.

T. Liu, J. Wu, Y. Wu, W. Hu, Z. Fang, Z. Wang, C. Jiang, S. Li, Front Cell Dev Biol 2022, 10, [eLocator: 795084].

a) J. R. Prensner, O. M. Enache, V. Luria, K. Krug, K. R. Clauser, J. M. Dempster, A. Karger, L. Wang, K. Stumbraite, V. M. Wang, G. Botta, N. J. Lyons, A. Goodale, Z. Kalani, B. Fritchman, A. Brown, D. Alan, T. Green, X. Yang, J. D. Jaffe, J. A. Roth, F. Piccioni, M. W. Kirschner, Z. Ji, D. E. Root, T. R. Golub, Nat. Biotechnol. 2021, 39, 697;

b) Q. Zhang, E. Wu, Y. Tang, T. Cai, L. Zhang, J. Wang, Y. Hao, B. Zhang, Y. Zhou, X. Guo, J. Luo, R. Chen, F. Yang, Mol. Cell. Proteomics 2021, 20, [eLocator: 100109].

L. Fang, Q. Chen, C. H. Wei, Z. Lu, K. Wang, (Preprint) arXiv:2302.01588IF: NANANA, v1, submitted: Feb 2023.

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, Adv. Neural Inform. Proc. Syst. 2022, 35, [eLocator: 27730].

a) S. Zhu, J. Z. Wang, D. Chen, Y. T. He, N. Meng, M. Chen, R. X. Lu, X. H. Chen, X. L. Zhang, G. R. Yan, Nat. Commun. 2020, 11, 1685;

b) M. Zhang, K. Zhao, X. Xu, Y. Yang, S. Yan, P. Wei, H. Liu, J. Xu, F. Xiao, H. Zhou, X. Yang, N. Huang, J. Liu, K. He, K. Xie, G. Zhang, S. Huang, N. Zhang, Nat. Commun. 2018, 9, 4475;

c) L. Sun, W. Wang, C. Han, W. Huang, Y. Sun, K. Fang, Z. Zeng, Q. Yang, Q. Pan, T. Chen, X. Luo, Y. Chen, Mol Cell 2021, 81, 4493;

d) B. Zhou, Y. Wu, P. Cheng, C. Wu, Mol. Oncol. 2023, 17, 1419.

K. A. Hoadley, C. Yau, T. Hinoue, D. M. Wolf, A. J. Lazar, E. Drill, R. Shen, A. M. Taylor, A. D. Cherniack, V. Thorsson, R. Akbani, R. Bowlby, C. K. Wong, M. Wiznerowicz, F. Sanchez‐Vega, A. G. Robertson, B. G. Schneider, M. S. Lawrence, H. Noushmehr, T. M. Malta, N. Cancer Genome Atlas, J. M. Stuart, C. C. Benz, P. W. Laird, Cell 2018, 173, 291.

a) H. Fu, T. Wang, X. Kong, K. Yan, Y. Yang, J. Cao, Y. Yuan, N. Wang, K. Kee, Z. J. Lu, Q. Xi, Nat. Commun. 2022, 13, 3984;

b) E. Senís, M. Esgleas, S. Najas, V. Jiménez‐Sábado, C. Bertani, M. Giménez‐Alejandre, A. Escriche, J. Ruiz‐Orera, M. Hergueta‐Redondo, M. Jiménez, A. Giralt, P. Nuciforo, M. M. Albà, H. Peinado, D. Del Toro, L. Hove‐Madsen, M. Götz, M. Abad, Front Cell Dev Biol 2021, 9, [eLocator: 747667].

The UniProt Consortium, Nucleic Acids Res. 2023, 51, D523.

B. W. Wright, Z. Yi, J. S. Weissman, J. Chen, Trends Cell Biol. 2022, 32, 243.

a) L. Ho, S. Y. Tan, S. Wee, Y. Wu, S. J. Tan, N. B. Ramakrishna, S. C. Chng, S. Nama, I. Szczerbinska, Y. S. Chan, S. Avery, N. Tsuneyoshi, H. H. Ng, J. Gunaratne, N. R. Dunn, B. Reversade, Cell Stem Cell 2015, 17, 435;

b) R. Jackson, L. Kroehling, A. Khitun, W. Bailis, A. Jarret, A. G. York, O. M. Khan, J. R. Brewer, M. H. Skadow, C. Duizer, C. C. D. Harman, L. Chang, P. Bielecki, A. G. Solis, H. R. Steach, S. Slavoff, R. A. Flavell, Nature 2018, 564, 434;

c) Q. Zhang, A. A. Vashisht, J. O'Rourke, S. Y. Corbel, R. Moran, A. Romero, L. Miraglia, J. Zhang, E. Durrant, C. Schmedt, S. C. Sampath, S. C. Sampath, Nat. Commun. 2017, 8, [eLocator: 15664].

H. G. Budayeva, D. S. Kirkpatrick, Nat. Rev. Drug Discovery 2020, 19, 414.

F. Cunningham, J. E. Allen, J. Allen, J. Alvarez‐Jarreta, M. R. Amode, I. M. Armean, O. Austine‐Orimoloye, A. G. Azov, I. Barnes, R. Bennett, A. Berry, J. Bhai, A. Bignell, K. Billis, S. Boddu, L. Brooks, M. Charkhchi, C. Cummins, L. Da Rin Fioretto, C. Davidson, K. Dodiya, S. Donaldson, B. El Houdaigui, T. El Naboulsi, R. Fatima, C. G. Giron, T. Genez, J. G. Martinez, C. Guijarro‐Clarke, A. Gymer, et al., Nucleic Acids Res. 2022, 50, D988.

G. R. Brown, V. Hem, K. S. Katz, M. Ovetsky, C. Wallin, O. Ermolaeva, I. Tolstoy, T. Tatusova, K. D. Pruitt, D. R. Maglott, T. D. Murphy, Nucleic Acids Res. 2015, 43, D36.

R. C. The, Nucleic Acids Res. 2019, 47, [eLocator: D1250].

L. Zhao, J. Wang, Y. Li, T. Song, Y. Wu, S. Fang, D. Bu, H. Li, L. Sun, D. Pei, Y. Zheng, J. Huang, M. Xu, R. Chen, Y. Zhao, S. He, Nucleic Acids Res. 2021, 49, D165.

P. J. Volders, J. Anckaert, K. Verheggen, J. Nuytens, L. Martens, P. Mestdagh, J. Vandesompele, Nucleic Acids Res. 2019, 47, D135.

a) X. Luo, Y. Huang, H. Li, Y. Luo, Z. Zuo, J. Ren, Y. Xie, Nucleic Acids Res. 2022, 50, [eLocator: D1373];

b) Z. Li, L. Liu, C. Feng, Y. Qin, J. Xiao, Z. Zhang, L. Ma, Nucleic Acids Res. 2023, 51, D186;

c) G. Zhang, C. Song, S. Fan, M. Yin, X. Wang, Y. Zhang, X. Huang, Y. Li, D. Shang, C. Li, Q. Wang, Nucleic Acids Res. 2024, 52, D919.

M. J. Goldman, B. Craft, M. Hastie, K. Repecka, F. McDade, A. Kamath, A. Banerjee, Y. Luo, D. Rogers, A. N. Brooks, J. Zhu, D. Haussler, Nat. Biotechnol. 2020, 38, 675.

Y. Hao, S. Hao, E. Andersen‐Nissen, W. M. Mauck, 3rd, S. Zheng, A. Butler, M. J. Lee, A. J. Wilk, C. Darby, M. Zager, P. Hoffman, M. Stoeckius, E. Papalexi, E. P. Mimitou, J. Jain, A. Srivastava, T. Stuart, L. M. Fleming, B. Yeung, A. J. Rogers, J. M. McElrath, C. A. Blish, R. Gottardo, P. Smibert, R. Satija, Cell 2021, 184, 3573.

A. Liberzon, C. Birger, H. Thorvaldsdottir, M. Ghandi, J. P. Mesirov, P. Tamayo, Cell Syst 2015, 1, 417.

M. Wang, X. Liu, G. Chang, Y. Chen, G. An, L. Yan, S. Gao, Y. Xu, Y. Cui, J. Dong, Y. Chen, X. Fan, Y. Hu, K. Song, X. Zhu, Y. Gao, Z. Yao, S. Bian, Y. Hou, J. Lu, R. Wang, Y. Fan, Y. Lian, W. Tang, Y. Wang, J. Liu, L. Zhao, L. Wang, Z. Liu, R. Yuan, et al., Cell Stem Cell 2018, 23, 599.

C. UniProt, Nucleic Acids Res. 2023, 51, D523.

The Gene Ontology Consortium, Genetics 2023, 224, [eLocator: iyad031].

S. Sayols, MicroPubl Biol 2023, [DOI: https://dx.doi.org/10.17912/micropub.biology.000811].

Y. Zhang, T. Liu, X. Hu, M. Wang, J. Wang, B. Zou, P. Tan, T. Cui, Y. Dou, L. Ning, Y. Huang, S. Rao, D. Wang, X. Zhao, Nucleic Acids Res. 2021, 49, 8520.

S. Aibar, C. B. Gonzalez‐Blas, T. Moerman, V. A. Huynh‐Thu, H. Imrichova, G. Hulselmans, F. Rambow, J. C. Marine, P. Geurts, J. Aerts, J. van den Oord, Z. K. Atak, J. Wouters, S. Aerts, Nat. Methods 2017, 14, 1083.

M. Kanehisa, M. Furumichi, Y. Sato, M. Kawashima, M. Ishiguro‐Watanabe, Nucleic Acids Res. 2023, 51, D587.

T. Liu, B. Zou, M. He, Y. Hu, Y. Dou, T. Cui, P. Tan, S. Li, S. Rao, Y. Huang, S. Liu, K. Cai, D. Wang, Brief Bioinform 2023, 24, [eLocator: bbac579].

Z. D. Su, Y. Huang, Z. Y. Zhang, Y. W. Zhao, D. Wang, W. Chen, K. C. Chou, H. Lin, Bioinformatics 2018, 34, 4196.

X. Qiu, Q. Mao, Y. Tang, L. Wang, R. Chawla, H. A. Pliner, C. Trapnell, Nat. Methods 2017, 14, 979.

Word count: 8223

Show less

© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Recent studies have revealed that numerous lncRNAs can translate proteins under specific conditions, performing diverse biological functions, thus termed coding lncRNAs. Their comprehensive landscape, however, remains elusive due to this field's preliminary and dispersed nature. This study introduces codLncScape, a framework for coding lncRNA exploration consisting of codLncDB, codLncFlow, codLncWeb, and codLncNLP. Specifically, it contains a manually compiled knowledge base, codLncDB, encompassing 353 coding lncRNA entries validated by experiments. Building upon codLncDB, codLncFlow investigates the expression characteristics of these lncRNAs and their diagnostic potential in the pan‐cancer context, alongside their association with spermatogenesis. Furthermore, codLncWeb emerges as a platform for storing, browsing, and accessing knowledge concerning coding lncRNAs within various programming environments. Finally, codLncNLP serves as a knowledge‐mining tool to enhance the timely content inclusion and updates within codLncDB. In summary, this study offers a well‐functioning, content‐rich ecosystem for coding lncRNA research, aiming to accelerate systematic studies in this field.

Details

Title

CodLncScape Provides a Self‐Enriching Framework for the Systematic Collection and Exploration of Coding LncRNAs

Author

Liu, Tianyuan¹

; Qiao, Huiyuan²; Wang, Zixu³; Yang, Xinyan⁴; Pan, Xianrun²; Yang, Yu⁵; Ye, Xiucai⁶; Sakurai, Tetsuya⁶; Lin, Hao⁷

; Zhang, Yang²

¹ Tsukuba Life Science Innovation Program, University of Tsukuba, Tsukuba, Japan
² Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
³ Department of Computer Science, University of Tsukuba, Tsukuba, Japan
⁴ Department of Developmental Biology, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
⁵ School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
⁶ Tsukuba Life Science Innovation Program, University of Tsukuba, Tsukuba, Japan, Department of Computer Science, University of Tsukuba, Tsukuba, Japan
⁷ School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China

Section

Research Articles

Publication year

2024

Publication date

Jun 1, 2024

Publisher

John Wiley & Sons, Inc.

e-ISSN

21983844

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/advs.202400009

ProQuest document ID

3066259245

CodLncScape Provides a Self‐Enriching Framework for the Systematic Collection and Exploration of Coding LncRNAs

Jump to:

Full text

Abstract

Details

Suggested sources