Improved circulating tumor DNA identification for

Full text

Turn on search term navigation

Introduction

DNA methylation alterations is one of the hallmarks of many cancers and are known to occur early during carcinogenesis, making them promising biomarkers for early-stage cancer detection^1,2. The DNA that carries cancer-specific methylation aberrations from tumor cells can be detected in plasma cfDNA³. Detection of circulating abnormal methylated tumor DNA (ctDNA) in plasma has achieved success in diagnosing various cancers^{4, 5, 6, 7, 8, 9–10}. Most of DNA methylation detection methods are based on bisulfite treatment. Whole-genome bisulfite sequencing (WGBS) is considered as the gold standard for methylation detection at single-base resolution. However, bisulfite treatment requires harsh chemical conditions, such as in low pH, high temperature, and high concentration salt solution for a long time, leading to DNA damage and DNA loss. This potentially reduces the accuracy and sensitivity of cancer detection, especially for the early stage of cancer that containing extremely low quantity of ctDNA in plasma. Recently, it has been reported that enzymae-based methylation sequencing doesn’t cause damage to DNA¹¹. However, few studies utilize this enzyme-based method to detect methylation in cfDNA for cancer diagnosis.

Esophageal squamous cell carcinoma (ESCC) is a worldwide healthcare concern. A notable characteristic of ESCC is that no obvious symptoms are observed in the early stage. Therefore, patients are frequently diagnosed only after reaching an advanced stage^12,13, which leads to a significant decrease in treatment effectiveness. The 5-year overall survival rate for advanced patients is approximately 25–30% when following comprehensive treatment, while the 5-year survival rate for early stage ESCC patients with surgical treatment may increase to 70%¹⁴. Therefore, early diagnosis and treatment of ESCC are of utmost clinical significance.

During the carcinogenesis of ESCC, DNA methylation change has been reported. The cfDNA methylation has also been demonstrated to be useful in ESCC diagnosis^{15, 16, 17–18}. However, all those studies were based on the bisulfite treatment methods, which potentially caused important CpG sites loss and ctDNA loss. Diagnosis of ESCC in early stage is difficult due to the relatively low content of ctDNA in the plasma. Therefore, preserving ctDNA to the greatest extent is beneficial for improving the detection rate of early ESCC. Our study utilized enzymatic method to detect whole genome methylation of ESCC tissue and plasma.

Neural networks are demonstrated to be able to accurately identify ctDNA, which provides ultrasensitive and robust cancer detection^11,19. Both convolutional neural network (CNN) and recurrent neural network (RNN), such as long short-term memory (LSTM) and gated recurrent unit (GRU), are commonly used due to RNN’s capability in sequence processing and CNN’s efficacy in capturing local DNA features^19,20. The combination of RNN and CNN has proven effective and robust in identifying ctDNA¹⁹. However, RNN propagates information recursively and causes the loss of previous information, which leads to the disability to integrate back and forth information from long distance DNA. To address the problem of RNN in long-distance dependencies and improve the performance of neural network in ctDNA identification, we develop a neural network named BCNN, a hybrid of BERT (Bidirectional Encoder Representations from Transformers) and CNN, to identify ctDNA for ESCC diagnosis. BERT employs a self-attention mechanism to directly capture dependencies between any two positions within a sequence, allowing the model to better capture long-range dependencies without encountering the vanishing or exploding gradient problem. The performance of BCNN in ctDNA identification and ESCC detection is compared with traditional models, including CNN, RNNs (GRU, LSTM) and the hybrid model of CNN with RNNs.

Materials and methods

Tissue and plasma sample collection

In this study, ESCC subjects were recruited from the Department of Thoracic Surgery in the First Affiliated Hospital of Soochow University. Total 50 human ESCC tissues and para-tumor tissues, and 35 ESCC plasma samples were collected. Other 43 plasma samples and other samples used in this study were from previous study¹¹. All ESCC patients included in this study had a confirmed clinical diagnosis and had not undergone chemotherapy or radiotherapy prior to the surgery. Tumor and para-tumor tissues were collected during surgical resection. The para-tumor tissues were sampled at a distance of 2 cm from the tumor margin to ensure minimal tumor cell contamination. The peripheral blood of ESCC individuals was collected before surgical resection. Approximately 10 ml of peripheral blood was collected from each participant by using Cell-Free DNA BCT tube (Streck). The clinical data and demographic characteristics of subjects were obtained from medical records. All procedures were approved by ethics committee of hospitals and written informed consent was obtained from all participants. Detailed statistics information about the subjects was summarized in supplementary Table S1.

Genomic DNA (gDNA) and plasma CfDNA isolation and library Preparation

Collected blood was centrifuged at 1600 g for 10 min at 4 °C and the plasma was centrifuged at 16,000 g for 10 min at 4 °C to remove cell debris. Plasma cfDNA and tissue gDNA were extracted by using blood/tissue DNA magnetic bead extraction kit (GeneOn Biotech). The procedures were conducted according to the manufacturer’s protocol. Qubit dsDNA HS Assay (Thermo Fisher Scientific) was used to check DNA quality and concentration. The extracted cfDNA and gDNA were stored at − 80 °C for ready use.

Library was prepared by enzyme-based method called NEEM-Seq¹¹. Briefly, Covaris instrument was used to acoustically shear gDNA to an average size 200–280 bp (peak approximately 250 bp). In each sample, unmethylated lambda DNA and fully methylated pUC19 DNA were added as internal controls. Sheared gDNA or cfDNA were subjected to the enzymatic conversion step using Enzymatic Methyl-seq Conversion Module (NEB, E7125S) according to the manufacturer’s protocol. The enzymatic converted DNA was then subjected to Accel-NGS Methyl-Seq DNA library Kit (IDT) for single-strand DNA library preparation according to the manufacture’s protocol.

The comparison of methylation detection between enzyme-based and bisulfite-based methods

To compare enzyme-based and bisulfite-based methylation detection methods, we analyzed publicly available sequencing data from our previous study¹¹. Briefly, we used two samples: a human reference gDNA sample (NA12878, Coriell) and a human cfDNA sample, with input amounts of 40 ng (gDNA) and 15 ng (cfDNA), respectively. Both enzyme-based and bisulfite-based library preparation utilized consistent input DNA quantities. The same input DNA was converted by using Enzymatic Methyl-seq Conversion Module (NEB, E7125S) and EZ DNA Methylation-Lightning Kit (Zymo) respectively. Then both enzymatic-converted DNA and bisulfite-converted DNA was subjected to Accel-NGS Methyl-Seq DNA library Kit (IDT) for single-strand DNA library preparation. Both libraries were sequenced to a depth of 10× coverage. To minimize batch effects, library preparation and sequencing for both methods were performed in parallel by the same technician on the same sequencing run.

Sequencing data processing and DMR identification of ESCC

150 bp paired-end sequencing was used for library sequencing on NovaSeq 6000 sequencers (Illumina). Fastp 0.20.1²¹ was used to filter low quality reads of raw data and adapters were trimmed. The filtering and trimming processes were conducted with default parameters of fastp. According to the manual of Swift library construction kit, additional 15 bases from the end of read 1 and 15 bases from the beginning of read 2 were trimmed to eliminate the majority of tail sequence after adapter trimming. Five bases from the beginning of read 1 and from the end of read 2 were also trimmed respectively. The reads length that shorter than 36 bp were removed after reads trimming. The mapping of the clean reads to the human reference genome (hg38) was done by Bismark 0.23.0²². Reads mapped simultaneously to two or more regions of the genome were removed, and only the unique mapped reads were retained. Bismark was also used to identify and remove PCR duplications, followed by extraction of methylation status of each site.

R package methylKit 1.18²³ was used to analyze differential methylation between tumor tissues and para-tumor tissues. We used 25 pairs of tumor tissues and para-tumor tissues to do analysis. Differentially methylated CpG sites (DMCs) were defined using the following criteria: (1) an FDR-adjusted p-value (q-value) < 0.01 and (2) a methylation rate difference > 25%. Differentially methylated regions (DMRs) were then defined by merging contiguous DMCs. Only DMRs satisfying the following two conditions were retained: (1) The distance between DMCs does not exceed 300 bp; (2) a minimum of five DMCs per region.

Further screening of DMRs

The number of retained reads among DMRs in tumor tissue varied after data noise reduction. An indicator called DMR Universality Score (DUS) was defined, which is used to screen out DMRs with a high proportion of retained reads in more individuals. The DUS of a given DMR was defined as follows²⁴:

where n represents the total number of individuals; t is the ratio of reads count after filtration to the total reads count before filtration in the individual i; d represents the proportion of individuals with t > 0 to the total number of individuals. The range of DUS value is from 0 to 1. If the number of remaining reads after filtration in all individuals is 0, the value of DUS will be equal to 0. In contrary, if no reads are filtered in all individuals, DUS value will be equal to 1. The DMRs were ranked according to DUS values from large to small, and top 10,000 hypo-methylated DMRs (hypo-DMRs) and all 1,967 hyper-methylated DMRs (hyper-DMRs) with DUS value > 0 were selected for subsequent analysis. The optimal parameters were determined by pre-experiments.

Gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) pathway enrichment analyses of DMRs

To check whether our identified DMRs involved in the occurrence and development of ESCC, the genes associated with top DMRs (top 10,000 hypo-methylated DMRs and 1967 hyper-methylated DMRs) were subjected to pathway enrichment analysis. The genes that overlapped with DMRs in the upstream and downstream 1Kb region of the transcription start site (TSS) were defined as DMR related genes. Gene Ontology (GO) analysis, which involved three categories, namely molecular functions (MF), cellular components (CC), and biological processes (BP), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were performed using the cluster Profiler R package, which facilitated biological terminology classification and gene cluster enrichment^{25, 26, 27, 28–29}.

Data cleaning for neural network training

Besides esophageal epithelial cells, the ESCC tumor tissues from resection usually contains some non-tumor cells (e.g. from different types of immune cells). If all reads from tumor tissues were labeled as tumor for model training, some noise non-tumor read will be falsely labelled, which may cause interference and confusion to neural network and affect the training effect. To reduce these noises, Healthy individual cfDNA were used to filter non-tumor reads from ESCC tumor tissues. The Methylation Continuity Score (MCS) of a given read was calculated and used as a reference for filtering by the following formula²⁴:

On the above formula, ‘L’ represented the number of CpGs within the read; ‘i’ defines a block that consisting of i continuous methylated CpGs in the read, and ‘ni’ is the number of corresponding blocks in the read.

The range of MCS value is from 0 to 1. The higher MCS value means higher methylation level of the read and more continuously distributed methylated CpGs, which also means less separated by the non-methylated CpGs in the read.

For each DMR, we calculated the MCS value of each read from tumor tissue DNA and from cfDNA of healthy individuals. Only the reads containing three or more DMCs were used. S_max and S_min represented the maximum and minimum MCS values that observed across all reads from cfDNA of healthy individuals respectively. Read filtering was then performed based on DMR methylation status: (1) In hypo-methylated DMR, we excluded tumor DNA reads with MCS values ≥ S_min. (2) In hyper-methylated DMR, we removed tumor DNA reads with MCS values ≤ S_max (see Fig. S1 for schematic representation). After applying the filtering criteria, all the normal cfDNA that presents in tumor tissues were removed. About 53% of reads from tumor tissues were discarded. For the DMR longer than 150 bp, sliding windows with 150 bp length and 50 bp step size were used to filter the reads within DMR. All the above filtering steps were performed for reads in each sliding window.

Neural architecture of BCNN and model training

The architecture of BCNN comprises a pretrained BERT module and a CNN module . The BERT module, identical to BERT-base, consists of an embedding layer and 12 transformer layers. The embedding layer combines token embedding with position embedding to map each token to a fixed-length real-valued vector. These vectors dynamically capture the semantic information of input tokens. The hidden size of token embedding is set to 768, consistent with BERT-base. The dimension of position embedding is (n, 768), where n represents the maximum sequence length processed by BERT. In this paper, n is set to 300. Only the encoder of transformer layers in BERT is utilized, including multi-head self-attention mechanisms and feed-forward neural networks. The multi-head self-attention mechanism concurrently computes multiple attention heads, merging their outputs to obtain the final hidden state. This mechanism facilitates parallel learning of multiple attention representations, mitigating the model’s tendency to overly focus on individual positions and enhancing its ability to capture global information across different granularities. The number of attention heads in the self-attention mechanism is 12, consistent with BERT-base. The CNN module is appended after the final layer of BERT and comprises a convolutional layer, a flattening layer, and a fully connected layer. It is noteworthy that the CNN module only participates in training during the fine-tuning stage.

The training process of BCNN comprises two stages: pre-training and fine-tuning. Only the BERT module in BCNN was utilized for pretraining and the pretrain process was similar with previous study¹¹. Briefly, BERT took DNA reads as input. To capture the differences between methylated and unmethylated bases in DNA sequences, we introduced the “ML” base in addition to the standard bases “ATCG”. “ML” bases represent the methylated “CG” bases (CpG site). We employed the commonly used k-mer approach in bioinformatics to tokenize DNA sequences, with k set to 5. Hence, each token of input sequence was composed of 5 consecutive bases, providing rich contextual information about the DNA sequence. Furthermore, a special token “[CLS]”, similar to BERT, was added at the first position of each sequence to represent the global information of the DNA sequence. The pre-training task for BCNN was a MLM (Masked Language Model) task. Due to low proportion of methylated CpG sites in the genome, 80% of tokens containing the “ML” representing methylated CpG sites in the DNA sequence was masked in MLM task. Increasing the mask of “ML” will help to learn the contextual information of masked methylated CpG site with surrounding DNA sequences. During the pretraining phase, about 10 million NEEM-seq reads from two cfDNA samples of two individuals were used in the pre-training phase. Only the reads with mapping quality greater than 30 were used. The paired-end reads that overlapping with each other were merged. After merging, only reads containing three or more DMCs were used for further training. The experimental batch size was set to 32 (32 sequences * 200 tokens = 6400 tokens/batch) for a total of 3,437,500 training steps. The optimizer Adam was selected, with a learning rate of 1e-4, β1 of 0.9, β2 of 0.999, and L2 weight decay of 0.01. The activation function was GELU, and a dropout probability of 0.1 was set across all layers to enhance the generalization capability of BCNN. The loss function for pretraining was the mean cross-entropy loss for predicting masked tokens.

The fine-tune training of BCNN was a binary classification task to distinguish tumor read and nontumor read. Tumor reads were from DNA of ESCC tumor tissue after noise reduction (labeled as “1”) and non-tumor reads were from non-tumor cfDNA (labeled as “0”). Totally, 0.8 million tumor reads and 1.1 million non-tumor reads were used for BCNN fine-tune training . The labelled dataset was partitioned into training and testing sets, with the training set comprising a total of 1.8 million DNA read. The testing set consisted of 0.1 million DNA reads. The cross-entropy was selected as loss function to optimize model parameters. All models were trained on Tesla V100 GPUs.

Model evaluation

Accuracy, f1 score, Matthews correlation coefficient (MCC) and precision recall (PR) curve are used to evaluate the performance of models in distinguishing tumor read and non-tumor read. The calculation of each metrics method is described below:

Specifically, letter “n” represents the total number of reads that used for evaluating model performance. True positives (TP) and true negatives (TN) represent the read number that model correctly predicts a positive class (tumor read, labelled as “1”) and a negative class (non-tumor read, labelled as “0”) respectively. False positives (FP) and false negatives (FN) represent the read number that model falsely predicts non-tumor read as a positive class and predicts tumor read as a negative class respectively.

Estimating the fraction of ctdna to calculate individual’s cancer risk for ESCC diagnosis

The probability predicted by the neural network was denoted as the probability that the read is from a cancer tissue. We inferred the proportion of ctDNA (reads from cancer tissue) in the plasma by calculating the maximum posterior probability inspired by CancerDetector³⁰. As shown below formula, t is the fraction of ctDNA in plasma for a given individual. We search for the t value that maximizes the value of the following formula, and this t value is regarded as the final risk score (RS) value.:

where p is the probability predicted by the neural network that a read is derived from ctDNA; t is the fraction of ctDNA in plasma; n is the number of reads. We search for the t value that maximizes the value of the above formula in the interval from 0 to 1 with a step size of 0.001, and this t value is regarded as the final RS value. The value of RS ranges from 0 to 1. The higher the RS value, the higher the individual’s cancer risk.

To evaluate the diagnostic performance of models, the individuals in validation cohort were randomly split into four roughly equal size groups (four folds, Fig. S2). One of the folds was first chosen to calculate risk score and was utilized to search for the best threshold of risk score. The remaining three folds were then combined to serve as a final independent validation cohort to evaluate the performance of ESCC diagnosis with different grouping based on TNM stage³¹, tumor size and lymphatic metastasis. This process was repeated four times. The AUC was also calculated in the three folds in each repetition.

Analysis of t-distributed stochastic neighbor embedding (t-SNE)

To visualize the high-dimensional vector that learned by the neural network under different labels, we extracted high-dimensional vector from the hidden state, the layer preceding the classification layer. We then used t-Distributed Stochastic Neighbor Embedding (t-SNE) in artificial intelligence^32,33 to reduce high-dimensional vector into two-dimension. t-SNE is a powerful and widely-used technique for the visualization of high-dimensional data. The core idea of t-SNE is to convert high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities. It then aims to find a low-dimensional embedding where the similarity of points is represented by a similar set of conditional probabilities.

The process of t-SNE involves two main steps. First, it calculates the pairwise similarities between data points in the high-dimensional space using conditional probabilities. This step aims to ensure that similar data points have higher probabilities of being neighbors. Second, t-SNE iteratively optimizes the positions of the points in the low-dimensional space to minimize the Kullback-Leibler divergence between the two distributions. This optimization ensures that the low-dimensional representation maintains the high-dimensional data’s local structure while spreading out the clusters to improve visualization clarity.

Visualizing attention map of the BCNN model

In order to unveiling the black box of model, the attention map of the BCNN was generated to visualize the important regions that contribute to the model decision on distinguishing tumor and non-tumor reads. The visualization of attention map was similar with the previous study¹¹. Briefly, the attention weight was extracted from the last layer of the fine-tuned model, which reflected the importance of each token for read identification. To find which bases are important for tumor read identification, the frequency of each base and the methylation state in tumor reads and non-tumor reads was calculated and visualized in a certain DMR.

Result

Enzyme-based Methyl sequencing can preserve more CpG sites and contain more longer DNA fragment than WGBS

To compare enzyme- and bisulfite-based methyl sequencing in parallel, the same samples and the same amount of DNA input were applied to these two methods. The same sequencing depth were also applied across different libraries. As shown in Fig. 1, enzyme-based methyl sequencing contains less proportion of short (70–135 bp) cfDNA fragment in both gene body and intergenic regions than WGBS (Fig. 1a). We further calculated the proportion of short cfDNA fragment size across whole genome. All the chromosomes are divided with 5-Mb bins. WGBS results in higher proportion of short cfDNA fragment across all bins of chromosomes (Fig. 1b). Enzymatic methyl sequencing also detects more CpG sites than WGBS across all the chromosomes in both cfDNA (Fig. 1c) and gDNA samples (Fig. 1d). Therefore, enzyme-based methyl sequencing can preserve more CpG sites and lead to less DNA damage than WGBS.

Fig. 1 [Images not available. See PDF.]

Comparison of bisulfite- and enzyme-based methyl sequencing. (a) The fragment size of cfDNA using bisulfite- and enzyme-based methyl sequencing. Bisulfite-based methyl sequencing results in the increase of short fragment (70–135 bp). (b) The ratio of short fragment (70–135 bp) in each 5-Mb bins across all chromosomes. (c) The detected number of CpG sites in cfDNA across all chromosomes. (d) The detected number of CpG sites in gDNA across all chromosomes.

Identification of ESCC-associated DMRs using enzyme-based Methyl sequencing

Using enzyme-based methyl sequencing, DMR in ESCC tissues are identified by comparing to para-tumor tissues. As shown in Fig. 2a, after data noise reduction, the top 10,000 Hypomethylated DMRs and all 1,967 hypermethylated DMRs were selected for subsequent analysis. Hypomethylated DMR are most found to locate in repeat regions, intergenic and intron regions of genome, while hypermethylated DMR are most found to locate in CpG island, exon, promoter regions (Fig. 2b). GO enrichment analysis of biological process showed that DMR related genes are most involved in adenylyl cyclase-modulating G-protein coupled receptor signaling pathways, neuropeptide signaling pathway, signal release, digestive system development, and monoamine transport (Fig. 2c). KEGG enrichment analysis showed that DMR related genes are most involved in neuroactive ligand-receptor interaction, cAMP signaling pathway, calcium signaling pathway (Fig. 2d). We also performed GO and KEGG enrichment analyses for hypo-DMR and hyper-DMR related genes, respectively. The results showed that the GO and KEGG enrichment results for hyper-DMR related genes (Fig. S3) were similar to those for all DMR related genes (Fig. 2c, d). However, there were no significant enrichment results for hypo-DMR related genes (significance level was set at 0.05).

Fig. 2 [Images not available. See PDF.]

Characterization of differentially methylated regions in ESCC. (a) The number of hypo- and hyper-methylated DMR. Hypo means the methylation level of ESCC are lower than para-tumor tissues and hyper means the methylation level of ESCC are higher than para-tumor tissues (b) The distribution of hypo- and hyper-methylated DMR across the whole genome in ESCC. (c) GO enrichment analysis of DMR related genes, which including biological processes (BP), cellular components (CC) and molecular functions (MF). (d) KEGG enrichment analysis of DMR related genes in ESCC.

Workflow chart of plasma ctdna identification and ESCC detection by combining ESCC-associated DMR and a hybrid neural network

Neural networks have been confirmed to be able to robustly identify plasma ctDNA by integrating DNA sequence and methylation information¹⁹. In this study, we developed a neural network named BCNN, a hybrid of BERT and CNN module, to identify plasma ctDNA for ESCC diagnosis. The whole process for ESCC detection is shown in Figure. 3. The 10 ml peripheral blood is drawn from the individual to obtain plasma for cfDNA extraction. Low-depth whole genome NEEM-seq is performed on the cfDNA sample. The DNA reads in the DMRs are selected and processed, which is recorded and tokenized. The processed DNA is then introduced into fine-tuned BCNN (see Methods) for prediction, which output the probability that the input DNA belongs to ctDNA. Higher probability predicts that the input DNA has higher possibility that it derived from ESCC tumor while lower probability predicts that the input DNA are not likely derived from ESCC tumor. Finally, the predicted probability of all plasma cfDNA reads from BCNN are integrated by maximum posterior probability for risk score estimation. Individual with higher risk score than threshold is diagnosed high risk for ESCC.

Fig. 3 [Images not available. See PDF.]

The whole process of ESCC detection. The 10 ml peripheral blood is drawn from the individual for cfDNA extraction. Low-depth whole genome NEEM-seq is performed on the cfDNA sample and reads within the DMRs are taken out for prediction. BCNN is used to predict the probability of each read that belongs to ctDNA. The methylated CpGs in DNA sequence is first recorded to ‘ML’, and then is tokenized by k-mer, where k is 5. The tokenized DNA is then put into BERT module for encoding the semantic information. The embedding result from BERT module is finally put into CNN module. The probability outputs of all reads from CNN module are used to calculate the risk score of each individual by formula of maximum posterior probability.

The composition of training and validation dataset for BCNN pretraining and fine-tuning

The fine-tune training of BCNN is based on the pretrained of BERT (see Method). The composition of the cohort and the division of the dataset for BCNN pretraining, fine-tune training and validation are shown in Fig. 4. In the pretraining stage, about 10 million reads from NEEM-seq data are used for pretrain of BERT module in BCNN, which allows BCNN to learn the human genome methyl-seq knowledge in advance. The learned knowledge is transferred to the downstream specific task for the fine-tune training of BCNN, which distinguishes tumor read from non-tumor read. In the fine tune training cohort, approximate 0.8 million tumor reads and 1.1 million non-tumor reads are used for training. Tumor reads are from 25 ESCC tumor tissues and non-tumor reads are from 23 healthy individuals’ cfDNA samples. All the read data from tumor tissues gDNA for fine-tune training is filtered by using cfDNA of healthy individuals, which removes reads from non-tumor cells in tumor tissues (see “Data cleaning for neural network” in methods section for details). The sequencing depth for tissue gDNA samples and cfDNA samples is about 10X. In the validation cohort, which including 35 ESCC and 16 healthy individuals’ cfDNA samples. The sequencing depth of all the cfDNA samples in the validation cohort is about 1.6X. The higher depth in the training cohort can provide more read data for the training of neural network, which is conducive to identifying more accurate DMRs.

Fig. 4 [Images not available. See PDF.]

The composition and division of cohort and dataset. Approximate 10 million whole genome methylated DNA reads are used for the pretrain of BERT module in BCNN. About 1.1 million non-tumor reads are from 23 cfDNA samples (healthy individuals) and 0.8 million tumor reads are from 25 ESCC patients. The details of tumor and non-tumor reads generation are described in Method. Total 35 ESCC patients and 16 healthy individuals are included in the validation cohort to verify the performance of BCNN. Whole genome NEEM-seq was performed on all samples.

BCNN better identified ESCC-derived tumor reads than other models

The most applied deep learning models in the genomic research (LSTM, GRU, CNN + LSTM, CNN + GRU) are used to compare with our BCNN. The performance of models in identifying ESCC-derived reads is evaluated by different metrics including accuracy, f1 score, MCC, and PR curve. As expected, BCNN achieves the best performance in distinguishing tumor and non-tumor reads (Fig. 5ab). To further check that whether noise read denoising affect the performance of tumor read identification, we also compare the performance of BCNN using the read datasets before and after noise reduction. BCNN using the read datasets after noise reduction showed higher accuracy, f1 score and MCC (Fig. S4).

Neural network distinguished tumor and non-tumor read by identifying their different methylation profiles and its surrounding DNA sequence. The sequence pattern of tumor and non-tumor reads are mapped into high-dimensional vectors by neural network, which are used for tumor and non-tumor read classification. Different neural networks showed various ability to catch sequence pattern difference of tumor and non-tumor read. To visualize the high-dimensional vectors of tumor and non-tumor reads in different models, we extracted high-dimensional vectors of reads from the hidden layer of different models and t-SNE was used to reduce high-dimension into two-dimension. The results of t-SNE show the extracted sequence pattern of tumor and non-tumor read from BCNN are perfectly concentrated and clustered respectively (Fig. 5c). Therefore, BCNN distinguishes tumor and non-tumor reads perfectly. In contrast, other traditional models show some over-lapping between tumor and non-tumor reads (Fig. 5d-h), which suggests some tumor and non-tumor reads cannot be distinguished by those models. To investigate how BCNN distinguish tumor and non-tumor reads, we try to extract the important methylation features captured by BCNN. The attention map of the BCNN was generated to visualize the important regions (blue color) that contribute to the model decision on distinguishing tumor and non-tumor reads (Fig. S5). The most important region contained DMC, where CpGs of the non-tumor reads were most fully methylated (“ML” represents methylated CpGs), but not in tumor reads. These results suggest BCNN can understand the deep semantic information (methylation difference) of DNA and capture the distinct features between tumor and non-tumor reads.

Fig. 5 [Images not available. See PDF.]

BCNN outperformed other models in identifying ESCC-derived tumor read. (a, b) The performance evaluation of different models in distinguishing ESCC-derived tumor read and non-tumor read by accuracy, f1_score, MCC, precision_recall curve. (c-h) Visualization of the high-dimensional vectors of tumor and non-tumor read in different models using t-SNE to reduce dimension. Tumor and non-tumor reads are fed into different trained models, and their high-dimensional vectors are extracted from the hidden layer of different models including BCNN, CNN_GRU, CNN_LSTM, GRU, LSTM, CNN. High-dimensional vectors of tumor and non-tumor read is reduced to two-dimensional vector using t-SNE to visualize the ability of different models in catching sequence pattern different of tumor and non-tumor reads.

BCNN achieved higher accuracy in detection of ESCC individuals than other models

When using these trained models to predict ctDNA (predicted probability p value > 0.5) in plasma cfDNA, significant different pattern of probability distribution of predicted ctDNA is observed in both healthy individuals and ESCC patients (Fig. 6a, b). Most of ctDNA predicted by BCNN are in high confident prediction probabilities (e.g. probability > 0.9), while other traditional models predict ctDNA with much lower confident probabilities (e.g. probability < 0.8).

Fig. 6 [Images not available. See PDF.]

BCNN outperformed other models in detection of ESCC individuals by accurately identifying plasma ctDNA. (a, b) Probability distribution of predicted cfDNA in healthy individuals and ESCC patients in the validation cohort by using different models. (c) Average ROC of models on ESCC individual detection based on the four-fold cross-validation in validation cohort. (d, e, f) Performance of BCNN on ESCC individual detection with different groups based on TNM stages, tumor size and lymphatic metastasis respectively.

Accurately identifying ctDNA from cfDNA is important for evaluating the risk score of ESCC. The individual risk score is estimated by maximum posterior probability method, which has been reported to be a useful method to predict cancer risk^19,30. We compare the performance of different models in distinguishing ESCC and healthy individuals in the independent validation cohort, which includes total 51 individuals (35 ESCC, 16 healthy individuals). As shown in Fig. 6b, BCNN shows highest average AUC (94.6% (confidence interval [CI] 93.0–96.3%) in four-fold cross validation datasets. We further check the performance of BCNN in early stage of ESCC. ESCC patients are separated into three subgroups based on the TNM staging system³¹. As results shown, BCNN still shows good sensitivities in detection early ESCC stage I (85.6%±10.3%, n = 8) and the specificity in detecting healthy individuals (n = 16) is 93.75% (± 4.2%) in the validation cohort (Fig. 6d). In ESCC stage II and III, BCNN shows better sensitivities in detection ESCC (Fig. 6d). Regardless of lymphatic metastasis, BCNN consistently demonstrates good sensitivity (Fig. 6e, Table S3). The sensitivity of BCNN for detecting ESCC patients with small (≤ 3 cm) and larger (> 3 cm) tumor size is 85.2% and 100% respectively (Fig. 6d, Table S3). In male and female groups, the sensitivity of BCNN for ESCC detection is 87.3% and 92.6% respectively (Fig. S6, Table S3).

To explore the influence of sequencing depth on the prediction performance of the model, we performed a down-sampling experiment on the reads from validation cohort. The detection results showed the sensitivity of ESCC detection decreased slightly when using 3G (1×) down-sampled data size, while specificity remained unchanged at 93.75%. (Table S2). However, the sensitivity decreased rapidly when data size of each sample was down sampled to 1G or lower (Table S2). Therefore, to achieve a good detection performance, we recommend a minimum coverage depth of 1× for each sample.

Discussion

Enzyme-based methylation detection is more advantageous than bisulfite-based method for abnormal methylated ctdna detection for cancer diagnosis

Detecting ctDNA in plasma enables the early detection of cancer³⁴, facilitates monitoring of treatment efficacy³⁵, and offers prognostic insights into tumor burden following treatment or surgery³⁵. The non-invasive nature of accessing ctDNA presents a significant advantage over needle biopsies, permitting more frequent patient monitoring without the associated risks and discomfort. Because the proportion of ctDNA in the plasma is extremely low^36,37, preserving ctDNA to the greatest extent is crucial for the detection of tumors, especially for early stage tumors. Bisulfite treatment is considered as gold standard for methylation detection^{38, 39, 40, 41–42}. However, our study finds bisulfite treatment causes the loss of CpG sites and the increase of short DNA fragment in cfDNA across the whole genome. This means, if some tumor-specific abnormal methylated CpG sites are loss in ctDNA, this ctDNA signal will loss in plasma and cannot be detected. This will potentially cause the false negative in detecting cancers. Therefore, our study suggests enzyme-based methylation detection is better for abnormal methylated ctDNA detection than bisulfite-based method.

Most of identified DMRs for ESCC detection are reported to be associated with ESCC progression

Because enzyme-based methyl sequencing showed more advantageous over bisulfite-based sequencing method, we did enzyme-based methyl sequencing in all our tissue and plasma samples. The identified DMRs in ESCC tissues are considered as the source of ctDNA for ESCC detection. Enrichment pathway analysis of DMRs showed that most of the top five enrichment pathways are associated with ESCC. For instance, substantial evidence has demonstrated crucial roles of G-protein coupled receptor signaling pathways in the development of gastrointestinal cancers including ESCC⁴³. Neuropeptide signaling pathway associated genes, such as LGI1 Receptor ADAM23, has been reported in regulating biomarkers of ferroptosis and progression of esophageal cancer⁴⁴. Monoamine metabolism has been changed in ESCC and monoamine transporter SLC22A3 was reported to drive early tumor invasion and metastasis in familial esophageal cancer^45,46. Also, epigenetic alterations of monoamine transporter was reported to increase the risk of esophageal cancer⁴⁷. Multi-omics data analysis in patients of esophageal squamous cell carcinoma suggested neuroactive ligand-receptor interaction pathways was associated with high risk of ESCC^{48, 49–50}. Pathway analysis of aberrantly methylated differentially expressed genes in ESCC patients suggested cAMP signaling was one of the most enrichment pathways⁵¹. Calcium signaling pathway, e.g. increased intracellular calcium concentration, was reported to inhibit cell proliferation and enhanced apoptosis of human ESCC cell lines⁵². Therefore, most of our identified DMR can be considered as ESCC-associated regions for further ESCC detection, in agreement with previous studies.

The outperformance of BCNN in identifying ESCC-derived ctDNA and detecting early stage of ESCC

The hybrid of RNN and CNN have been widely applied in various biological fields, including genomic sequencing^{19,20,53, 54, 55, 56–57}. A major disadvantage of RNN is that it doesn’t solve the problem of long-distance dependence, which makes them difficult to learn deep semantics of genome sequences and difficult to gain a good insight into their internal correlations. The self-attention mechanism in transformer enables direct capture of dependencies between any two positions in a long sequence. This facilitates transformer to integrate context information of long sequence without encountering the vanishing or exploding gradient problem. BERT comprising multiple layers of transformers, leverages unsupervised pre-training on large-scale data to learn rich semantic information⁵⁸, which improves model’s information integration ability. In this study, to enhance the performance of ESCC sourced ctDNA identification and improve ESCC detection, we first pre-trains the BERT module in our BCNN using methyl-seq data, followed by fine-tuning training for the specific task of ESCC-derived tumor read identification. As expected, the hybrid of BERT and CNN model shows better performance in distinguishing tumor and non-tumor read than the hybrid of RNN and CNN model and also other traditional models. When applied the trained model in predicting tumor tissue sourced ctDNA, much more confident prediction probabilities (e.g. probability > 0.9) and less unconfident prediction probabilities (e.g. 0.5 < probability < 0.7) are achieved by BCNN than by traditional models. These results suggest BCNN better understand deep DNA semantic information and grasp the key difference between tumor and non-tumor DNA more accurately than traditional models.

Our proposed workflow only requires low-depth methylation sequencing of the cfDNA samples (> 1×) to detect ESCC, thus the detection cost is controllable. Targeted methylation sequencing is expected to further reduce the average sequencing cost per sample. The turnaround time for the entire workflow is approximately one week. This low-cost but effective method reinforces the idea of employing BCNN as a novel strategy for ESCC detection including early stage in clinical practice. The preliminary results showed BCNN is able to detect the early stage of ESCC (TNM stage I, 7 out of 8 patients was detected). However, this is just a preliminary study due to the small sample size, more samples are required to evaluate the method. Further, we will also collaborate with other centers and collect more independent samples to further validate the model’s performance. Our model detects ESCC by identified ctDNA (aberrant methylated DNA), thus if ESCC patients from different centers have similar aberrant methylation regions in their tumors, our model should be able to detect them theoretically.

In our study, we didn’t test whether BCNN can distinguish ESCC from other diseases or not. In theory, to distinguish other diseases from ESCC, it is necessary to identify the differential methylation regions of genome between other disease and ESCC. By training the model with new data in these regions, the model could differentiate ESCC from other disease. The key challenge in cross-cancer detection is identifying cancer-specific aberrant methylation regions unique to each cancer type. If such cancer-specific methylation markers are found, BCNN can also be customized for multi-cancer detection theoretically. In future studies, we plan to collect samples from different cancer types and explore the model’s performance in multi-cancer detection.

Conclusions

The presence of tumor-specific methylated CpG sites in cfDNA is important for ctDNA identification. Our study showed enzymatic methyl sequencing can preserve more CpG sites and longer DNA fragment for ctDNA detection than bisulfite-based method. Therefore, enzymatic methyl sequencing is more advantageous for preserving the trace amount of ctDNA for cancer diagnosis. Further, we develop a hybrid neural network named BCNN which can identify ctDNA more accurately with higher confidence when comparing to other traditional models. By combining atic methyl sequencing with our BCNN model, ESCC can be detected via accurately identifying ctDNA. However, to evaluate the clinical utility of model, larger sample size is necessary to check the model performance.

Acknowledgements

We are grateful to the patients, volunteers, and their family members for the participation in this study. Many thanks to all the anonymous reviewers’ comments for helping to improve this manuscript.

Author contributions

DJ, CCX, JCX, and DYX design the study. YLW, LQZ, HW, JXL, SJ, and XT collect samples and clinical data. JFS and LC check clinical data. SSX and QXH download and analysis public dataset. JFS, YQR, YLW, ZYM, QXH, JCD and YTL do statistics analysis. ZYM and QXH performs wet experiments. JFS, YQR, YLW, ZYM and QXH write draft of manuscript. YTL, XT, LC, CCX, JCX, DYX and DJ revise manuscript. All authors read and approved the final manuscript.

Funding

This work was supported by National Natural Science Foundation of China (81972800, 82203276), National Natural Science Foundation horizontal project (H231174), Suzhou Health Talent Training Project (GSWS2020008), the grants of Suzhou Science and Technology Development Plan (SKY2023044), Key Project of Taizhou School of Clinical Medicine of Nanjing Medical University (TZKY20220309), College Student Innovation and Entrepreneurship Training Program Project (2023xj031, Soochow University).

Data availability

The raw sequence data reported in this paper will be deposited in the Genome Sequence Archive in National Genomics Data Center, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA-Human: HRA010450) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa-human. Applicants can request the data via the GSA website by completing the relevant information that required and submitting a data download application. The Data Access Committee will review the application, and upon approval, the applicant will be granted data download access. The source code of BCNN has been uploaded and is available from GitHub (https://github.com/Bamrock/BCNN).

Declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

The study was approved by the ethic committee of the First Affiliated Hospital of Soochow University (SZFAT-2021-002) and conformed to the principles of the Helsinki Declaration. All the participants wrote informed content.

Abbreviations

ESCC

Esophageal squamous cell carcinoma

TNM

Tumor-node-metastasis

NGS

Next-generation sequencing

cfDNA

Cell-free DNA

ctDNA

Circulating tumor DNA

CNN

Convolutional neural network

RNN

Recurrent neural network

LSTM

Long short-term memory

GRU

Gated recurrent units

BERT

Bidirectional encoder representations from transformers

DMRs

Differentially methylated regions

DMCs

Differentially methylated CpG sites

MCS

Methylation continuity score

DUS

DMR universality score

MCC

Matthews correlation coefficient

Confidence interval

AUC

Area under the curve

ROC

Receiver operating characteristic

Precision recall

True positives

True negatives

False positives

False negatives

Risk score

NEEM-seq

No end-repair enzymatic methylation sequencing

MLM

Masked language model

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Ibrahim, J; Peeters, M; Van Camp, G; Op de Beeck, K. Methylation biomarkers for early cancer detection and diagnosis: current and future perspectives. Eur. J. Cancer; 2023; 178, pp. 91-113.1:CAS:528:DC%2BB38XivFals7bE [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36427394][DOI: https://dx.doi.org/10.1016/j.ejca.2022.10.015]

2. Nishiyama, A; Nakanishi, M. Navigating the DNA methylation landscape of cancer. Trends Genet.; 2021; 37, 11 pp. 1012-1027.1:CAS:528:DC%2BB3MXht1SltLfI [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34120771][DOI: https://dx.doi.org/10.1016/j.tig.2021.05.002]

3. Luo, H; Wei, W; Ye, Z; Zheng, J; Xu, RH. Liquid biopsy of methylation biomarkers in Cell-Free DNA. Trends Mol. Med.; 2021; 27, 5 pp. 482-500.1:CAS:528:DC%2BB3MXktlWms7s%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33500194][DOI: https://dx.doi.org/10.1016/j.molmed.2020.12.011]

4. Hlady, RA et al. Genome-wide discovery and validation of diagnostic DNA methylation-based biomarkers for hepatocellular cancer detection in Circulating cell free DNA. Theranostics; 2019; 9, 24 pp. 7239-7250.1:CAS:528:DC%2BB3cXps1Gis7o%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31695765][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6831291][DOI: https://dx.doi.org/10.7150/thno.35573]

5. Wu, X et al. Circulating tumor DNA as an emerging liquid biopsy biomarker for early diagnosis and therapeutic monitoring in hepatocellular carcinoma. Int. J. Biol. Sci.; 2020; 16, 9 pp. 1551-1562.1:CAS:528:DC%2BB3cXhvVSjsL7J [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32226301][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7097921][DOI: https://dx.doi.org/10.7150/ijbs.44024]

6. Xu, RH et al. Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma. Nat. Mater.; 2017; 16, 11 pp. 1155-1161.2017NatMa.16.1155X1:CAS:528:DC%2BC2sXhs1aisr3F [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29035356][DOI: https://dx.doi.org/10.1038/nmat4997]

7. Zhang, C et al. Meta-analysis of DNA methylation biomarkers in hepatocellular carcinoma. Oncotarget; 2016; 7, 49 pp. 81255-81267. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27835605][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5348390][DOI: https://dx.doi.org/10.18632/oncotarget.13221]

8. Constancio, V et al. Early detection of the major male cancer types in blood-based liquid biopsies using a DNA methylation panel. Clin. Epigenetics; 2019; 11, 1 175.1:CAS:528:DC%2BC1MXitleht7fI [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31791387][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6889617][DOI: https://dx.doi.org/10.1186/s13148-019-0779-x]

9. Li, P et al. Liquid biopsies based on DNA methylation as biomarkers for the detection and prognosis of lung cancer. Clin. Epigenetics; 2022; 14, 1 118.2022rcod.book...L1:CAS:528:DC%2BB38XisV2gtbfE [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36153611][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9509651][DOI: https://dx.doi.org/10.1186/s13148-022-01337-0]

10. Nunes, S. P. et al. Cell-Free DNA methylation of selected genes allows for early detection of the major cancers in women. Cancers (Basel)10(10). (2018).

11. Deng, Z et al. Early detection of hepatocellular carcinoma via no end-repair enzymatic methylation sequencing of cell-free DNA and pre-trained neural network. Genome Med.; 2023; 15, 1 93.1:CAS:528:DC%2BB3sXitlCitbjE [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37936230][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10631027][DOI: https://dx.doi.org/10.1186/s13073-023-01238-8]

12. Sohda, M; Kuwano, H. Current status and future prospects for esophageal cancer treatment. Ann. Thorac. Cardiovasc. Surg.; 2017; 23, 1 pp. 1-11. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28003586][DOI: https://dx.doi.org/10.5761/atcs.ra.16-00162]

13. Higuchi, K et al. Current management of esophageal squamous-cell carcinoma in Japan and other countries. Gastrointest. Cancer Res.; 2009; 3, 4 pp. 153-161. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19742141][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2739640]

14. Ma, Y et al. Methylation Silencing of TGF-beta receptor type II is involved in malignant transformation of esophageal squamous cell carcinoma. Clin. Epigenetics; 2020; 12, 1 25.1:CAS:528:DC%2BB3cXltFelur8%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32046777][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7014638][DOI: https://dx.doi.org/10.1186/s13148-020-0819-6]

15. Lin, L; Cheng, X; Yin, D. Aberrant DNA methylation in esophageal squamous cell carcinoma: biological and clinical implications. Front. Oncol.; 2020; 10, 549850. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33194605][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7645039][DOI: https://dx.doi.org/10.3389/fonc.2020.549850]

16. Xi, Y et al. Multi-omic characterization of genome-wide abnormal DNA methylation reveals diagnostic and prognostic markers for esophageal squamous-cell carcinoma. Signal. Transduct. Target. Ther.; 2022; 7, 1 53.1:CAS:528:DC%2BB38XnsFeqsb8%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35210398][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8873499][DOI: https://dx.doi.org/10.1038/s41392-022-00873-8]

17. Chen, C et al. Genome-wide profiling of DNA methylation and gene expression in esophageal squamous cell carcinoma. Oncotarget; 2016; 7, 4 pp. 4507-4521. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26683359][DOI: https://dx.doi.org/10.18632/oncotarget.6607]

18. Teng, H et al. Inter- and intratumor DNA methylation heterogeneity associated with lymph node metastasis and prognosis of esophageal squamous cell carcinoma. Theranostics; 2020; 10, 7 pp. 3035-3048.1:CAS:528:DC%2BB3cXhsl2mt7jP [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32194853][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7053185][DOI: https://dx.doi.org/10.7150/thno.42559]

19. Li, J. et al. DISMIR: deep learning-based noninvasive cancer detection by integrating DNA sequence and methylation information of individual cell-free DNA reads. Brief Bioinform22(6). (2021).

20. Quang, D; Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res.; 2016; 44, 11 e107. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27084946][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4914104][DOI: https://dx.doi.org/10.1093/nar/gkw226]

21. Chen, S; Zhou, Y; Chen, Y; Gu, J. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics; 2018; 34, 17 pp. i884-i890. [DOI: https://dx.doi.org/10.1093/bioinformatics/bty560] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30423086][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6129281]

22. Krueger, F; Andrews, SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics; 2011; 27, 11 pp. 1571-1572.1:CAS:528:DC%2BC3MXmvVWqurw%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21493656][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102221][DOI: https://dx.doi.org/10.1093/bioinformatics/btr167]

23. Akalin, A et al. MethylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol.; 2012; 13, 10 R87. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23034086][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491415][DOI: https://dx.doi.org/10.1186/gb-2012-13-10-r87]

24. Deng, Z. et al. Early detection of hepatocellular carcinoma via no end-repair enzymatic methylation sequencing of cell-free DNA and pre-trained neural network. Genome Med15(1). (2023).

25. Ashburner, M et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet.; 2000; 25, 1 pp. 25-29.1:CAS:528:DC%2BD3cXjtFSlsbc%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/10802651][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3037419][DOI: https://dx.doi.org/10.1038/75556]

26. Gene Ontology, C. et al. The gene ontology knowledgebase in 2023. Genetics224(1). (2023).

27. Kanehisa, M; Furumichi, M; Sato, Y; Matsuura, Y; Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res.; 2025; 53, D1 pp. D672-D677. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39417505][DOI: https://dx.doi.org/10.1093/nar/gkae909]

28. Kanehisa, M. Toward Understanding the origin and evolution of cellular organisms. Protein Sci.; 2019; 28, 11 pp. 1947-1951.1:CAS:528:DC%2BC1MXhslaisL3I [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31441146][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6798127][DOI: https://dx.doi.org/10.1002/pro.3715]

29. Kanehisa, M; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res.; 2000; 28, 1 pp. 27-30.1:CAS:528:DC%2BD3cXhvVGqu74%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/10592173][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102409][DOI: https://dx.doi.org/10.1093/nar/28.1.27]

30. Li, W et al. CancerDetector: ultrasensitive and non-invasive cancer detection at the resolution of individual reads using cell-free DNA methylation sequencing data. Nucleic Acids Res.; 2018; 46, 15 e89. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29897492][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6125664][DOI: https://dx.doi.org/10.1093/nar/gky423]

31. Group CsocoW. Esophageal cancer diagnosis and treatment guidelines (2023 edition). (2023).

32. Van der Maaten, L; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.; 2008; 9, Nov pp. 2579-2605.

33. Van der Maaten, L; Hinton, G. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res.; 2014; 15, Oct pp. 3221-3245.3277169

34. Lennon, A. M. et al. Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science 369(6499). (2020).

35. Wan, J. C. M. et al. CtDNA monitoring using patient-specific sequencing and integration of variant reads. Sci Transl Med12(548). (2020).

36. Markus, H et al. Refined characterization of Circulating tumor DNA through biological feature integration. Sci. Rep.; 2022; 12, 1 1928.2022NatSR.12.1928M1:CAS:528:DC%2BB38XivV2ls7k%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35121756][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8816939][DOI: https://dx.doi.org/10.1038/s41598-022-05606-z]

37. Wen, X., Pu, H., Liu, Q., Guo, Z. & Luo, D. Circulating Tumor DNA-A Novel Biomarker of Tumor Progression and Its Favorable Detection Techniques. Cancers (Basel) 14(24). (2022).

38. Li, Y; Tollefsbol, TO. DNA methylation detection: bisulfite genomic sequencing analysis. Methods Mol. Biol.; 2011; 791, pp. 11-21.1:CAS:528:DC%2BC38XkvV2ls7s%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21913068][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3233226][DOI: https://dx.doi.org/10.1007/978-1-61779-316-5_2]

39. Mill, J et al. Whole genome amplification of sodium bisulfite-treated DNA allows the accurate estimate of methylated cytosine density in limited DNA resources. Biotechniques; 2006; 41, 5 pp. 603-607.1:CAS:528:DC%2BD28Xht1entr7L [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/17140118][DOI: https://dx.doi.org/10.2144/000112266]

40. Liu, J et al. Genome-wide cell-free DNA methylation analyses improve accuracy of non-invasive diagnostic imaging for early-stage breast cancer. Mol. Cancer; 2021; 20, 1 36.1:CAS:528:DC%2BB3MXkvVegu7s%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33608029][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7893735][DOI: https://dx.doi.org/10.1186/s12943-021-01330-w]

41. Gao, Y et al. Whole-genome bisulfite sequencing analysis of Circulating tumour DNA for the detection and molecular classification of cancer. Clin. Transl Med.; 2022; 12, 8 e1014.1:CAS:528:DC%2BB38XisV2gsLzN [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35998020][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9398227][DOI: https://dx.doi.org/10.1002/ctm2.1014]

42. Legendre, C et al. Whole-genome bisulfite sequencing of cell-free DNA identifies signature associated with metastatic breast cancer. Clin. Epigenetics; 2015; 7, 1 100. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26380585][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4573288][DOI: https://dx.doi.org/10.1186/s13148-015-0135-8]

43. Zeng, Z. et al. Roles of G Protein-Coupled receptors (GPCRs) in Gastrointestinal cancers: focus on sphingosine 1-Shosphate receptors, angiotensin II receptors, and Estrogen-Related GPCRs. Cells10(11). (2021).

44. Chen, C; Zhao, J; Liu, JN; Sun, C. Mechanism and role of the neuropeptide LGI1 receptor ADAM23 in regulating biomarkers of ferroptosis and progression of esophageal cancer. Dis. Markers; 2021; 2021, 9227897. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35003396][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8739919][DOI: https://dx.doi.org/10.1155/2021/9227897]

45. Tao, Y et al. Identification of distinct gene expression profiles between esophageal squamous cell carcinoma and adjacent normal epithelial tissues. Tohoku J. Exp. Med.; 2012; 226, 4 pp. 301-311.1:CAS:528:DC%2BC38XptVaiurc%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22499122][DOI: https://dx.doi.org/10.1620/tjem.226.301]

46. Fu, L et al. RNA editing of SLC22A3 drives early tumor invasion and metastasis in Familial esophageal cancer. Proc. Natl. Acad. Sci. U S A; 2017; 114, 23 pp. E4631-E4640.1:CAS:528:DC%2BC2sXotFyhtbk%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28533408][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5468658][DOI: https://dx.doi.org/10.1073/pnas.1703178114]

47. Xiong, JX et al. Epigenetic alterations of a novel antioxidant gene SLC22A3 predispose susceptible individuals to increased risk of esophageal cancer. Int. J. Biol. Sci.; 2018; 14, 12 pp. 1658-1668.1:CAS:528:DC%2BC1MXhsVymt7zO [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30416380][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6216027][DOI: https://dx.doi.org/10.7150/ijbs.28482]

48. Ma, F et al. Heterogeneity analysis of esophageal squamous cell carcinoma in cell lines, tumor tissues and Patient-Derived xenografts. J. Cancer; 2021; 12, 13 pp. 3930-3944.1:CAS:528:DC%2BB3MXitFSktrzM [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34093800][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8176252][DOI: https://dx.doi.org/10.7150/jca.52286]

49. Chattopadhyay, I et al. Genome-wide analysis of chromosomal alterations in patients with esophageal squamous cell carcinoma exposed to tobacco and betel quid from high-risk area in India. Mutat. Res.; 2010; 696, 2 pp. 130-138.1:CAS:528:DC%2BC3cXis1Wksbk%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20083228][DOI: https://dx.doi.org/10.1016/j.mrgentox.2010.01.001]

50. Chen, Z et al. RNA-Associated Co-expression network identifies novel biomarkers for digestive system cancer. Front. Genet.; 2021; 12, 659788.1:CAS:528:DC%2BB3MXhvVCqurvO [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33841514][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8033200][DOI: https://dx.doi.org/10.3389/fgene.2021.659788]

51. Han, BA et al. Identification of candidate aberrantly methylated and differentially expressed genes in esophageal squamous cell carcinoma. Sci. Rep.; 2020; 10, 1 9735.2020NatSR.10.9735H1:CAS:528:DC%2BB3cXht1SktrnP [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32546690][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7297810][DOI: https://dx.doi.org/10.1038/s41598-020-66847-4]

52. Wang, X et al. Effect of TRPM2-Mediated calcium signaling on cell proliferation and apoptosis in esophageal squamous cell carcinoma. Technol. Cancer Res. Treat.; 2021; 20, 15330338211045213.1:CAS:528:DC%2BB38XjtlegtA%3D%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34605693][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8642046][DOI: https://dx.doi.org/10.1177/15330338211045213]

53. Kelley, DR; Snoek, J; Rinn, JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res.; 2016; 26, 7 pp. 990-999.1:CAS:528:DC%2BC28XhsFOhsb7F [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27197224][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937568][DOI: https://dx.doi.org/10.1101/gr.200535.115]

54. Zhou, J; Troyanskaya, OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods; 2015; 12, 10 pp. 931-934.1:CAS:528:DC%2BC2MXhtlynsL%2FL [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26301843][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4768299][DOI: https://dx.doi.org/10.1038/nmeth.3547]

55. Li, J., Pu, Y., Tang, J., Zou, Q. & Guo, F. DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences. Brief Bioinform22(3). (2021).

56. Liu, Q; Xia, F; Yin, Q; Jiang, R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics; 2018; 34, 5 pp. 732-738. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29069282][DOI: https://dx.doi.org/10.1093/bioinformatics/btx679]

57. Min, X; Zeng, W; Chen, N; Chen, T; Jiang, R. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics; 2017; 33, 14 pp. i92-i101.1:CAS:528:DC%2BC1cXitFOhtrfI [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28881969][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870572][DOI: https://dx.doi.org/10.1093/bioinformatics/btx234]

58. Devlin, J. C. M., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional Transformers for Language Understanding. NAACL-HLT2019(1):4171–4186 .

Word count: 8547

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Detection of cancer at early stage can significantly improve the five-year survival rate of patients. Bisulfite-based methylation detection can cause DNA damage, especially in high GC content regions which was associated with the development of cancers. Loss of aberrant methylated CpG sites in cfDNA will lead to the undetectability of certain circulating tumor DNA (ctDNA), consequently may affect the cancer detection. Our study uses enzymatic method to detect whole genome abnormal methylation regions in esophageal squamous cell carcinoma (ESCC). We also provide a pretrained neural network, hybrid of BERT and CNN (BCNN), to identify ctDNA robustly. Maximum posterior probability is utilized to estimate the fraction of ESCC-derived ctDNA in plasma for predicting the risk of ESCC cancer. Our results analysis indicated that enzyme-based whole-genome methylation sequencing retained more longer cfDNA and detected more CpG sites than bisulfite-based method in both gDNA and cfDNA. Enrichment analysis of differentially methylated regions (DMR) showed that top five pathways were associated with ESCC. Compared to traditional models, our BCNN demonstrates the best performance in identifying ctDNA (AUC = 0.970). By estimating the fraction of plasma ctDNA, our BCNN exhibits high accuracy in ESCC detection even at ultralow sequencing depths (AUC = 0.946). Specifically, in the validation cohort, when the specificity is 93.75%, 7 out of 8 early-stage ESCC (TNM Stage I) were identified as positive in our preliminary results. In conclusion, our preliminary results reinforce the idea of employing BCNN as a novel strategy for ESCC early detection in clinical practice. However, to be applied in clinical, further validation with a larger sample size is necessary.

Details

Title

Improved circulating tumor DNA identification for detection of esophageal squamous cell carcinoma by enzymatic methyl sequencing and hybrid neural network

Author

Shen, Jiangfeng¹; Ren, Yuqi²; Mao, Ziyong³; Hu, Qiuxiang³; Wan, Yilong⁴; Du, Jiangcun²; Lai, Yanting⁵; Shao, Shengxiang⁶; Zhang, Liuqing⁴; Wu, Hao⁴; Li, Jiaxi⁴; Ju, Sheng⁴; Tong, Xin⁴; Zhao, Jun⁴; Cao, Lei⁷; Xiong, Deyi²; Xu, ChengCheng⁴; Xu, Jun-Chi⁸; Jiang, Dong⁴

¹ Taizhou School of Clinical Medicine, The Affiliated Taizhou People’s Hospital of Nanjing Medical University, Nanjing Medical University, Jiangsu Province, China (ROR: https://ror.org/059gcgy73) (GRID: grid.89957.3a) (ISNI: 0000 0000 9255 8984)
² College of Intelligence and Computing, Tianjin University, Tianjin, China (ROR: https://ror.org/012tb2g32) (GRID: grid.33763.32) (ISNI: 0000 0004 1761 2484)
³ BamRock Research Department, Suzhou BamRock Biotechnology Ltd, Suzhou, Jiangsu Province, China
⁴ Department of Thoracic Surgery, The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu Province, China (ROR: https://ror.org/051jg5p78) (GRID: grid.429222.d) (ISNI: 0000 0004 1798 0228)
⁵ College of Agricultural and Environmental Sciences, University of California, Davis, USA (ROR: https://ror.org/05rrcem69) (GRID: grid.27860.3b) (ISNI: 0000 0004 1936 9684)
⁶ Medical College of Soochow University, Suzhou, Jiangsu Province, China (ROR: https://ror.org/05kvm7n82) (GRID: grid.445078.a) (ISNI: 0000 0001 2290 4690)
⁷ Jiangsu Institute of Clinical Immunology & Jiangsu Key Laboratory of Clinical Immunology, The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu Province, China (ROR: https://ror.org/051jg5p78) (GRID: grid.429222.d) (ISNI: 0000 0004 1798 0228)
⁸ The Fifth People’s Hospital of Suzhou, Suzhou, Jiangsu Province, China (ROR: https://ror.org/05jy72h47) (GRID: grid.490559.4)

Pages

33004

Section

Article

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20452322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41598-025-18278-2

ProQuest document ID

3254840726

Improved circulating tumor DNA identification for detection of esophageal squamous cell carcinoma by enzymatic methyl sequencing and hybrid neural network

Jump to:

Full text

Abstract

Details

Suggested sources