Full text

Translate

Turn on search term navigation

Abbreviations

AUC
area under the curve

BPNs
benign pulmonary nodules

CEA
carcinoembryonic antigen

CT
computed tomography

FPR
false positive rate

LDCT
low-dose CT

LUAD
lung adenocarcinoma

LUSC
lung squamous cell carcinoma

MPNs
malignant pulmonary nodules

NSCLC
non–small cell lung cancer

NSE
neuron-specific enolase

PNs
pulmonary nodules

PPV
positive predictive value

ROC
receiver-operating characteristic

SCC
squamous cell carcinoma antigen

SCLC
small cell lung cancer

INTRODUCTION

Lung cancer is the leading cause of cancer-related death in men and women.¹ As early symptoms are atypical, patients tend to be diagnosed in advanced stage. The 5-year survival of distant-stage lung cancer is only 5%, whereas it could be up to 57% for localized stage.^2–4 Therefore, an effective and reliable means for the diagnosis of lung cancer, especially early lung cancer, could contribute to the improvement of the prognosis and survival of patients.

Due to its accuracy in lung cancer diagnosis, computed tomography (CT) is widely used in the clinic. However, CT has a certain false positive rate (FPR) in screening lung cancer, which increases financial costs and psychological pressure for the patients. Moreover, repeated CT examination increases the risk of lung cancer.^5–8

Therefore, reducing the false positive value (FPV) of CT and developing a method to distinguish malignant pulmonary nodules (MPNs) from benign pulmonary nodules (BPNs) could promote the application of CT in the clinic and help exclude BPNs from subsequent unnecessary treatment.

Carcinoembryonic antigen (CEA), a 180-kD glycoprotein, was first described in 1965 by Gold and Freedman.⁹ CEA is regarded as a human tumor biomarker, especially for gastrointestinal tumors.¹⁰ At the same time, serum CEA is widely used for progression monitoring of lung cancer.^11–13 Moreover, several research studies^14,15 have reported that serum CEA had a high specificity in detecting lung cancer. While looking for new preferable diagnostic biomarkers, traditional biomarkers in the clinic should be reasonably applied to support the diagnosis of lung cancer.

Nowadays, multiple models are used for the differential diagnosis of pulmonary nodules (PNs), such as the Mayo model and¹⁶ Peking University model.¹⁷ However, various potential issues exist; for example, instability, single index, and small sample size. In the present study, we aimed to establish a model combining CT and CEA to make use of the mutual advantage. We collected the CEA concentration and 14 CT characteristics of the 363 participants from the training data set. Univariate and multivariate analyses were used to identify the variables which could distinguish MPNs from BPNs. Based on the candidate factors, we developed six different algorithmic models combining CEA and CT via SPSS Modeler in the training data set and validated them in an independent data set to differentiate MPNs and BPNs. Subsequently, we compared the optimal model with the Peking University model and Mayo model in the validation data set. Finally, we evaluated the differential diagnostic ability of the selected model for MPNs and BPNs in stratified analyses.

MATERIALS AND METHODS Study population

A total of 492 participants who met the inclusion criteria were enrolled from the First Affiliated Hospital of Zhengzhou University. The inclusion criteria were as follows: (1) CT and serum CEA detection of the participants were performed in the first Affiliated Hospital of Zhengzhou University, and the clinical data were available; (2) patients with PNs were newly confirmed by histopathologic diagnosis, and history of malignancy was not found in patients with BPNs; (3) participants were all older than 18 years. The exclusion criteria contained the following terms: (1) participants without complete clinical data and clear pathological results; (2) not newly diagnosed PNs and patients with BPNs having a history of malignancy; (3) participants younger than 18 years of age. The workflow of study design and the clinical information of participants are displayed in Figure 1 and Table 1, respectively. The study was approved by the Medical Ethics Committee of the First Affiliated Hospital of Zhengzhou University (2021-KY-1057-002).

Training data set

View Image - FIGURE 1. Workflow chart of the study. A total of 363 participants were enrolled to screen potential CT characteristics and construct six models to MPNs from BPNs via SPSS Modeler. Another 129 participants were recruited in the validation data set to validate the six models. BPN, benign pulmonary nodule; CEA, carcinoembryonic antigen; MPN, malignant pulmonary nodule

FIGURE 1. Workflow chart of the study. A total of 363 participants were enrolled to screen potential CT characteristics and construct six models to MPNs from BPNs via SPSS Modeler. Another 129 participants were recruited in the validation data set to validate the six models. BPN, benign pulmonary nodule; CEA, carcinoembryonic antigen; MPN, malignant pulmonary nodule

TABLE 1 Characteristics of the participants

Classification	Training data set (n = 363)		Validation data set (n = 129)
	MPN, N (%)	BPN, N (%)	MPN, N (%)	BPN, N (%)
Age
Mean	60.1	54.1	60.8	55.4
SD	11.2	12.4	11.6	13.0
Sex
Male	175 (64.6)	54 (58.7)	45 (48.9)	20 (54.1)
Female	96 (35.4)	38 (41.3)	47 (51.1)	17 (45.9)
Smoker
Yes	122 (45.0)	26 (28.3)	27 (29.3)	13 (35.1)
No	149 (55.0)	66 (71.7)	65 (70.7)	24 (64.9)
Drinker
Yes	59 (21.8)	19 (20.7)	12 (13.0)	7 (18.9)
No	212 (78.2)	73 (79.3)	80 (87.0)	30 (81.1)
Family cancer history
Yes	25 (9.2)	3 (3.3)	10 (10.9)	5 (13.5)
No	246 (90.8)	89 (96.7)	82 (89.1)	32 (86.5)
CEA (ng/ml)
<5	192 (70.8)	87 (94.6)	51 (55.4)	35 (94.6)
≥5	79 (29.2)	5 (5.4)	41 (44.6)	2 (5.4)
TNM stage
I + II	85 (31.4)		22 (24.0)
III + IV	175 (64.6)		58 (63.0)
Uncertain	11 (4.0)		12 (13.0)
Pathological type
SCLC	30 (11.1)		11 (12.0)
LUSC	67 (24.7)		15 (16.3)
LUAD	156 (57.6.)		61 (66.3)
Chronic inflammation		34 (37.0)		25 (67.6)
Pulmonary tuberculosis		17 (18.5)		0 (0.0)
Granuloma		5 (5.4.)		6 (16.2)
Others	18 (6.6)	36 (39.1)	5 (5.4)	6 (16.2)

Abbreviations: BPN, benign pulmonary nodule; CEA, carcinoembryonic antigen; LUAD, lung adenocarcinoma; LUSC, squamous cell lung carcinoma; MPN, malignant pulmonary nodule; SCLC, small cell lung cancer.

A total of 363 participants in the training data set were recruited from September 2016 to March 2019, whose information constituted the sample library of PNs. The training data set was used to construct the diagnostic model based on the logistic regression for the differentiation between the MPNs and BPNs.

Validation data set

The diagnostic performance of the logistic model was validated in an independent data set of 129 participants who underwent lung puncture biopsy in January 2018.

Electrochemiluminescence and CT detection

Serum CEA concentrations of all participants were measured by electrochemiluminescence on an automatic analyzer operated by professional technicians of the First Affiliated Hospital of Zhengzhou University. The CEA values were distributed widely (0.12-5311 ng/ml), and in order to narrow the data range, eliminate heteroscedasticity, and make the data stable, we converted CEA to lnCEA via the transformation algorithm. The transformation could not affect the nature and correlation of the data.

Spiral CT images were acquired using a 64-detector CT row scanner. The CT data were stored in the imaging archive and communication system. The 14 nodule characteristics data of CT (diameter, number, type, morphology, margin, cavity, spiculation, vascular notch sign, lobulation, spines, pleural indentation, calcification, emphysema, and mediastinal lymph node enlargement) were rated by a senior radiologists using a 2-megapixel gray-scale monitor at lung windows and mediastinal windows.

Statistical analysis

Data analysis was performed by SPSS (Version 21.0) and GraphPad Prism software (version 6.0). The chi-squared test and the Mann-Whitney U test were applied for comparing the variable values between MPNs and BPNs. Six machine learning–based models (logistic, multilayer perceptron [MLP], radial basis function network [RBF], C5.0, chi-squared automatic interaction detection [CHAID], and support vector machines [SVM]) were employed to develop the predictive models for MPNs in the training data set and validated in an independent data set via SPSS Modeler 18.0 software. Nomogram production and DeLong test were generated by R4.0.3. Receiver-operating characteristic (ROC) curve analysis and area under the curve (AUC) were applied to evaluate the diagnostic value of each variable or model. The sensitivity and specificity were decided by the maximum Youden index. In all tests, p < 0.05 (two-sided) was considered significant.

RESULTS Patient characteristics

A total of 492 participants with definitive diagnosis were included in the training (n = 363) and validation (n = 129) data sets. The clinical information of the participants is summarized in Table 1.

The distribution of benign and malignant nodules in the training and validation data sets according to lesion size (small: <10 mm, intermediate: 10-30 mm, and large: ≥30 mm) are displayed in Figure S1. The small nodules were mostly BPNs, while the large ones tended to be MPNs. The proportion of BPNs and MPNs in the intermediate (10-30 mm) nodules was similar (Figure S1).

Characteristic selection

According to the univariate and ROC analyses, the potential indicators for MPNs included lnCEA, diameter, number, type, morphology, margin, cavity, spiculation, vascular notch sign, lobulation, spines, pleural indentation, calcification, emphysema, and mediastinal lymph node enlargement (Table S1). Six indicators (lnCEA, number, spiculation, vascular notch sign, lobulation, and mediastinal lymph node enlargement) were identified as independent diagnostic biomarkers for MPNs by multivariate logistic regression analysis (Table S1 and Figure 2A–F). The AUC range of six predictors for distinguishing MPNs from BPNs ranged from 0.625 to 0.783. Among them, lobulation had the highest diagnostic value with an AUC of 0.783.

View Image - FIGURE 2. Receiver-operating characteristic (ROC) curve analysis was performed to evaluate the diagnostic value of (A) lnCEA, (B) nodule number, (C) spiculation, (D) vascular notch sign, (E) lobulation, and (F) lymph node enlargement in the training data set. AUC, area under the curve

FIGURE 2. Receiver-operating characteristic (ROC) curve analysis was performed to evaluate the diagnostic value of (A) lnCEA, (B) nodule number, (C) spiculation, (D) vascular notch sign, (E) lobulation, and (F) lymph node enlargement in the training data set. AUC, area under the curve

Model construction and validation

Four machine learning algorithms (logistic regression, artificial neural networks [ANN], decision trees [DTs], and SVM) were used to construct six models (logistic, MLP, RBF, C5.0, CHAID, and SVM) via SPSS modeler 18.0. Among them, the MLP and RBF models were established based on ANN, while the CHAID and C5.0 models were DTs. Models (logistic, MLP, RBF, CHAID and SVM) included six indicators (lnCEA, number, spiculation, vascular notch sign, lobulation, and mediastinal lymph node enlargement), while model C5.0 included five indicators except mediastinal lymph node enlargement. After comparison, the logistic model had the highest diagnostic efficiency (Table 2 and Figure 3A,B). The DeLong test showed that in the AUCs of the logistic model there was no significant statistical difference between the training and validation data sets (p = 0.371) (Table 2).

TABLE 2 The diagnostic efficiency of the six models in training and validation data sets

Model	Training data set				Validation data set				Delong test
	AUC (95% CI)	Se (%)	Sp (%)	Accuracy (%)	AUC (95% CI)	Se (%)	Sp (%)	Accuracy (%)	P value
Logistic	0.912 (0.881, 0.944)	84.1	84.8	84.3	0.882 (0.824, 0.940)	80.4	75.7	79.1	0.371
RBF	0.788 (0.741, 0.835)	68.6	82.6	72.1	0.831 (0.763, 0.899)	71.7	94.6	78.3	0.309
MLP	0.783 (0.737, 0.829)	64.2	87.0	70.0	0.825 (0.756, 0.894)	68.5	89.2	74.4	0.327
C5.0	0.751 (0.698, 0.805)	68.6	88.0	73.5	0.741 (0.652, 0.830)	59.8	89.2	68.2	0.848
CHAID	0.719 (0.664, 0.775)	51.3	95.7	62.6	0.666 (0.562, 0.770)	33.7	94.6	51.2	0.374
SVM	0.778 (0.727, 0.830)	68.6	80.4	71.6	0.759 (0.675, 0.843)	64.1	81.1	69.0	0.701

Abbreviations: AUC, area under the curve; CHAID, chi-squared automatic interaction detection; MLP, multilayer perceptron; RBF, radial basis function; Se, sensitivity; Sp, specificity; SVM, support vector machines.

View Image - FIGURE 3. Receiver-operating characteristic (ROC) curve analysis was performed to evaluate the diagnostic value of six models in the training (A) and validation data sets (B)

FIGURE 3. Receiver-operating characteristic (ROC) curve analysis was performed to evaluate the diagnostic value of six models in the training (A) and validation data sets (B)

The mathematical model established by logistic regression was as follows: logit (p = MPN) = e^x/(1 + e^x), x = −2.857 + (1.116 × mediastinal lymph node enlargement) + (2.078 × lobulation) + (0.997 × vascular notch sign) + (1.316 × spiculation) + (1.493 × nodule number) + (0.796 × lnCEA); where e is the natural logarithm, and mediastinal lymph enlargement (0: no, 1: yes), lobulation (0: no, 1: yes), vascular notch sign (0: no, 1: yes), spiculation (0: no, 1: yes), and nodule number (0: >1, 1: 1) are derived from the CT report.

Model comparison

Compared with the CT and lnCEA alone, the model had higher AUC (95%), sensitivity, and specificity, which made up for the deficiency and combined the advantages of the two detection methods. Not significantly, but it reduces the FPR of CT (25.6% – 17.8%) (Table 3).

TABLE 3 The diagnostic value of CT, lnCEA and their combination in the training and validation data sets

Markers	AUC (95% CI)	Se (%)	Sp (%)	FPR (%)	PPV (%)	NPV (%)	Accuracy (%)
Training data set
CT	0.887 (0.849, 0.929)	82.7	81.5	18.5	93.0	61.5	82.4
lnCEA	0.707 (0.651, 0.764)	29.2	94.6	5.4	94.1	31.2	45.7
CT + lnCEA	0.912 (0.880, 0.944)	84.1	84.8	15.2	94.2	64.5	84.3
Validation data set
CT	0.794 (0.713, 0.876)	78.3	56.8	43.2	81.8	51.2	72.1
lnCEA	0.839 (0.770, 0.907)	44.6	94.6	5.4	95.4	40.7	58.9
CT + lnCEA	0.882 (0.824, 0.940)	80.4	75.7	24.3	89.2	60.9	79.1
Combined
CT	0.883 (0.852, 0.914)	81.5	74.4	25.6	90.0	58.9	79.7
lnCEA	0.747 (0.702, 0.792)	33.1	94.6	5.4	94.5	33.4	49.2
CT + lnCEA	0.903 (0.876, 0.930)	83.2	82.2	17.8	92.9	63.5	82.9

Abbreviations: AUC, area under the curve; CEA, carcinoembryonic antigen; FPR, false positive rate; NPV, negative predictive value; PPV, positive predictive value; Se, sensitivity; Sp, specificity.

Based on the validation data set, the logistic model displayed a higher AUC of 0.882 (Figure 4 and Table 4) than the Peking University model; which formula is logit (p = MPN) = e^x/(1 + e^x), x = −4.496 + (0.07 × age) + (0.676 × diameter) + (0.736 × spiculation) + (1.267 × family history of cancer) − (1.615 × calcification) − (1.408 × border); e is the natural logarithm, and 1 represents yes and 0 represents no in the last four elements (family cancer history, calcification, spiculation, and border), with an AUC of 0.748 (Figure 4 and Table 4). Meanwhile, the logistic model showed higher AUC than the Mayo model (AUC = 0.745) (Figure 4 and Table 4); the formula is logit (p = MPN) = e^x/(1 + e^x), x = −6.8272 + (0.0391 × age) + (0.7917 × smoking history) + (1.3388 × cancer history) + (0.1274 × diameter) + (1.0407 × spiculation) + (0.7838 × the upper lobe); e is the natural logarithm, and 1 represents yes and 0 represents no in the four elements (smoking history, cancer history, spiculation, and the upper lobe).

View Image - FIGURE 4. Based on the validation data set, the receiver-operating characteristic (ROC) curve analyses of the three models (Peking University model, Mayo model, and the logistic model) were performed and compared

FIGURE 4. Based on the validation data set, the receiver-operating characteristic (ROC) curve analyses of the three models (Peking University model, Mayo model, and the logistic model) were performed and compared

TABLE 4 Comparison of diagnostic efficiency about the three models

Models	AUC (95% CI)	Se (%)	Sp (%)	FPR (%)	PPV (%)	NPV (%)	Accuracy (%)
Peking University model	0.712 (0.607, 0.817)	68.5	70.3	29.7	85.2	47.3	38.8
Mayo model	0.745 (0.653, 0837)	62.0	75.7	24.3	86.4	44.5	37.7
Logistic model	0.882 (0.824, 0.940)	80.4	75.7	24.3	89.2	60.9	79.1

Abbreviations: AUC, area under the curve; FPR, false positive rate; NPV, negative predictive value; PPV, positive predictive value; Se, sensitivity; Sp, specificity.

Model visualization

A nomogram was produced according to the logistic regression model, different variables were assigned different scores, and finally the total score was obtained corresponding to the probability of patients with MPN via R. As shown in Figure 5, we selected a BPN patient as sample, and the values of the CT variables and lnCEA were as follows: vascular notch sign = 0, mediastinal lymph node enlargement = 0, spiculation = 0, nodule number = 0, and loculation = 0; lnCEA = −0.2. Finally, we obtained the probability of MPN as 4.75%.

View Image - FIGURE 5. Nomogram for predicting the probability of MPN. As an example, Add the scores achieved for each CT characteristic and lnCEA, obtained the sum on the “total points” axis. Draw a line straight down to determine the probability of MPN. **p [less than] 0.01, ***p [less than] 0.001. MLNE, mediastinal lymph node enlargement; MPN, malignant pulmonary nodule; NO, nodule number; Pr(MPN), probability of MPN; VNS, vascular notch sign

FIGURE 5. Nomogram for predicting the probability of MPN. As an example, Add the scores achieved for each CT characteristic and lnCEA, obtained the sum on the “total points” axis. Draw a line straight down to determine the probability of MPN. **p [less than] 0.01, ***p [less than] 0.001. MLNE, mediastinal lymph node enlargement; MPN, malignant pulmonary nodule; NO, nodule number; Pr(MPN), probability of MPN; VNS, vascular notch sign

Stratified analyses

In the CEA-negative PNs, as shown in Table S2, the logistic model showed an AUC (95% CI) of 0.891 (0.851, 0.932) with 78.6% sensitivity, 87.4% specificity, and 93.2% positive predictive value (PPV). In the intermediate-size (10-30 mm) PNs, the logistic model indicated an AUC (95% CI) of 0.856 (0.797, 0.916) with 77.6% sensitivity, 78.9% specificity, and 88.3% PPV. For the solitary PNs, the logistic model displayed an AUC (95% CI) of 0.888 (0.842, 0.934) with 86.6% sensitivity, 77.1% specificity, and 94.3% PPV. Meanwhile, in the solid PNs (SPNs), the logistic model showed an AUC (95% CI) of 0.928 (0.898, 0.920) with 88.5% sensitivity, 83.3% specificity, and 94.3% PPV. In addition, the logistic model exhibited an AUC (95% CI) of 0.857 (0.802, 0.911) for early-stage (TNM I + II) MPNs with 70.6% sensitivity, 84.8% specificity, and 81.1% PPV; and 0.912 (0.879, 0.945) for non–small cell lung cancer (NSCLC) with 83.8% sensitivity, 84.8% specificity, and 93.5% PPV.

Intermediate-size nodule analysis

In the intermediate-size (10-30 mm) nodule category, proportion of cases between MPNs and BPNs was similar (Figure S1A,B). As for the early-stage lung cancer, the logistic model showed an AUC (95% CI) of 0.822 (0.743, 0.900) with 67.3% sensitivity, 78.9% specificity, 77.1% PPV (Figure 6A, Table 5). The CEA-negative, solid nodules had a high proportion in the category (Table 5). In the CEA-negative PNs, the logistic model displayed an AUC (95% CI) of 0.835 (0.764, 0.905) with 72.3% sensitivity, 83.3% specificity, and 88.2% PPV (Figure 6B and Table 5). For the SPNs, the logistic model exhibited an AUC (95% CI) of 0.885 (0.824, 0.946) with 85.1% sensitivity, 78.1% specificity, and 87.5% PPV (Figure 6C and Table 5).

View Image - FIGURE 6. Predictive outcome of the logistic model in comparison with the gold standard pathology assessment with intermediate nodules was based on the training data set. Samples were sorted according to (A) clinical stage, (B) CEA concentration, and (C) solid nodules. BPN, benign pulmonary nodule; CEA, carcinoembryonic antigen; MPN, malignant pulmonary nodule

FIGURE 6. Predictive outcome of the logistic model in comparison with the gold standard pathology assessment with intermediate nodules was based on the training data set. Samples were sorted according to (A) clinical stage, (B) CEA concentration, and (C) solid nodules. BPN, benign pulmonary nodule; CEA, carcinoembryonic antigen; MPN, malignant pulmonary nodule

TABLE 5 Stratified analyses of the model's diagnostic efficiency for intermediate (10-30 mm) nodules (N = 159)

Characteristics		N	AUC (95%CI)	P	Se (%)	Sp (%)	PPV (%)	NPV (%)	Accuracy (%)
TNM	I + II	55	0.822 (0.743, 0.900)	<0.001	67.3	78.9	77.1	69.5	72.9
CEA (ng/ml)	<5	131	0.835 (0.764, 0.905)	<0.001	72.3	83.3	88.2	63.5	76.3
Nodule type	Solid	115	0.885 (0.824, 0.946)	<0.001	85.1	78.1	87.5	74.4	82.6

Abbreviations: AUC, area under the curve; CEA, carcinoembryonic antigen; NPV, negative predictive value; PPV, positive predictive value; Se, sensitivity; Sp, specificity.

DISCUSSION

Due to its noninvasiveness and high sensitivity, CT is widely applied in screening high-risk lung cancer.⁶ In the present study, 14 CT characteristics were collected, and 5 CT characteristics (nodule number, mediastinal lymph node enlargement, vascular notch sign, spiculation, and lobulation) were selected as independent risk factors for differentiating MPNs from BPNs through univariate, ROC, and multivariable analysis.

Numerous studies have shown that the lobulation can effectively distinguish BPNs and MPNs.^18–20 Similarly, lobulation as an independent CT predictor for distinguishing MPNs from BPNs had the highest AUC of 0.783 in the study, followed by spiculation (AUC = 0.750, p < 0.001) and vascular notch sign (AUC = 0.675, p < 0.001). As reported, spiculation, a numerous, radial, short, and straight-line shadow of the nodule, is a specific sign of lung cancer.^21,22 Vascular notch sign, one or several vessels around the PNs enlarged and connected with the nodule, has been reported to participate in distinguishing MPNs from BPNs. Previous research confirmed that multiple nodules tend to be inflammatory nodules or metastatic lung cancer.²³ A solitary PN represents a common diagnostic challenge in clinical practice. In our study, nodule number exhibited an excellent capability for distinguishing MPNs from BPNs (AUC = 0.625, p < 0.001). Lymph node enlargement was related to inflammatory stimulation and tumor metastasis normally, which were identified to associate with the metastasis of lung cancer.^24,25 In the study, mediastinal lymph node enlargement could distinguish BPNs from MPNs effectively (AUC = 0.657, p < 0.001).

Though CT including low-dose CT (LDCT) could detect small nodules effectively, it had a high FPR of 96.4% in a study on lung cancer screening in a high-risk population.⁶ Therefore, it is necessary to combine CT with multiple methods to reduce its FPR and promote its application in clinical practice. Serum biomarkers had been reported to play an important role in distinguishing MPNs from BPNs.²⁶ Also, serum biomarkers have the advantage of being noninvasive, convenient to access, and low cost, which prompted its extensive application.

The abnormal expression of serum CEA assisted in the diagnostic of lung cancer in the clinic.^12,27,28 For distinguishing MPNs from BPNs, serum CEA has a higher specificity than other serum biomarkers, such as neuron-specific enolase (NSE), with low sensitivity.²⁹ In this study, we converted CEA to lnCEA via the transformation algorithm to narrow the data range, eliminate heteroscedasticity, and make the data stable.

Data mining is a new technology that emerged with the development of database and artificial intelligence technology as a further extension of the statistical method. It could help researchers to mine and discover latent information and rules from data. Currently, the commonly used methods are logistic regression, ANN,³⁰ SVM,³¹ and DT.³² Logistic regression model is frequently used in data mining to support disease diagnosis, including distinguishing MPN from BPN.¹⁷ The ANN model is a simplified intelligent model constructed by mathematical methods from the perspective of information processing by imitating the signal transmission between human brain neurons, which can be used for classification and prediction. SVM is based on structural risk minimization, which has great advantages in building high-risk models with small sample sizes. DT is a classification method that follows a tree structure and is the earliest and most important method used to solve classification problems. It can effectively solve the problems caused by linearity and nonlinearity between variables. In the present study, six diagnostic models based on four different algorithms were constructed via SPSS Modeler. Through comparison, the logistic model had excellent diagnostic efficiency compared with the others. The logistic model displayed an AUC of 0.912 in the training data set with 84.1% sensitivity and 84.4% specificity. The model is robust in the validation data set with an AUC of 0.882.

When combined the two data sets, the diagnostic FPR of the logistic model reduced 7.8% (25.6% – 17.8%) compared with the separate application of CT (Table 3). At the same time, compared with lnCEA, with the clinical criteria (CEA = 5 ng/ml) as the cutoff value, the logistic model improved the diagnostic sensitivity from 33.1% to 83.2%, with the specificity having decreased only by 12.4%. The above results showed that the logistic model could combine the advantages of the CT and CEA to distinguish MPNs from BPNs effectively and made up for the disadvantages of their application alone.

In stratified analyses, the logistic model displayed a stable and favorable ability to differentiate MPNs and BPNs. Especially, for CEA-negative patients, the logistic model showed 93.2% PPV and 81.4% accuracy, thus improving the sensitivity of CEA. For PNs with a diameter < 10 mm, the logistic model exhibited good performance with a specificity of 100.0%. As for early-stage MPNs and BPNs, the logistic model displayed a benign PPV of 88.1%.

Moreover, the logistic model could distinguish MPNs with different pathological types (small cell lung cancer, NSCLC, lung squamous cell carcinoma, and lung adenocarcinoma) from BPNs accurately, with an AUC range from 80.1% to 95.9% at 84.8% specificity.

Guidelines for PNs proposed different monitoring schemes based on the diameter. Among the PNs, intermediate nodules (10-30 mm) occupied a large proportion, which was difficult to identify. Patz reported that the CART model combining nodule size, α₁-antitrypsin (AAT), CEA, and SCC had a PPV of 88% in distinguishing MPNs from BPNs with intermediate diameter.³³ In the present study, the logistic model displayed an excellent differential diagnostic ability for MPNs and BPNs in the intermediate diameter subgroup, where the AUC was 0.856 with 88.3% PPV. Besides, for CEA-negative PNs and solid PNs, the logistic model displayed AUCs of 0.835 and 0.885, respectively.

The Peking University model and Mayo model were the earlier lung cancer prediction models for differentiating MPNs and BPNs.^16,17 The indicators of the Peking University model and Mayo model were the clinical and radiological variables; serum biomarkers were not involved. Univariate and multivariate analyses were applied to screen independent risk factors, and spiculation was selected as significant indicator in distinguishing MPNs from BPNs in the three models (Peking University model, Mayo model, and the logistic model). Calcification, diameter, and border which were independent risk factors in the Peking University or Mayo models showed no significant difference in distinguishing MPNs from BPNs in our study, that might due to differences in the inclusion criteria or sample heterogeneity. The logistic model possessed higher diagnostic value (AUC = 0.882) than the Peking University model (AUC = 0.712) and the Mayo model (AUC = 0.745) based on the validation data set.

The study also has its limitations. Given the complexity of tumorigenesis, a single biomarker could easily introduce bias to the predictive model. Additional traditional biomarkers and novel biomarkers should be merged, and more efficient strategies should be developed to improve the clinical management of PNs. Aim to improve the diagnostic efficiency, fusion algorithms need to be explored. Also, this study is a retrospective single-center study with limited participants.

CONCLUSION

The logistic model combined five CT characteristics, and lnCEA had a favorable diagnostic efficiency in distinguishing MPNs from BPNs. Moreover, with regard to distinguishing CEA-negative and intermediate-diameter MPNs from BPNs in the early stage, the logistic model had an outstanding performance. In the follow-up research, the sample size will be further expanded and multicenter, prospective studies will be performed to verify the stability of the model and make contributions to the distinguishing of lung cancer.

ACKNOWLEDGMENTS

We gratefully acknowledge Henan Key Laboratory of Pharmacology for Liver Diseases for providing platform support.

FUNDING INFORMATION

This work was supported by the Leading Talents of Science and Technology Innovation in Henan Province (Grant Number 20420051008), the National Natural Science Foundation of China (Grant Number 8167291), the Key Project of Discipline Construction of Zhengzhou University (Grant Number XKZDQY202009), and the Project of Basic Research Fund of Henan Institute of Medical and Pharmacological Sciences (Grant Number 2022BP0104).

DISCLOSURE

The authors have no conflict of interest.

ETHICAL APPROVAL

Approval of the research protocol: The research protocol was approved by the Medical Ethics Committee of the First Affiliated Hospital of Zhengzhou University (2021-KY-1057-002).

Informed Consent: NA.

Registry and the Registration No. of the study/trial: N/A.

Animal Studies: N/A.

Word count: 4368

Show less

© 2022. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Computed tomography (CT), an efficient radiological technology, is used to detect lung cancer in the clinic. Carcinoembryonic antigen (CEA), a common tumor biomarker, is applied in the detection of various tumors. To highlight the advantages of two‐dimensional techniques and assist clinicians in optimizing lung cancer diagnostic schemes, we established a favorable model combining CT and CEA. In the study, univariate analysis was performed to screen independent predictors in a training cohort of 271 patients with malignant pulmonary nodules (MPNs) and 92 with benign pulmonary nodules (BPNs). Six machine learning–based models involving five CT predictors (mediastinal lymph node enlargement, lobulation, vascular notch sign, spiculation, and nodule number) and lnCEA were constructed and validated in an independent cohort of 129 participants (92 MPNs and 37 BPNs) by SPSS Modeler. A nomogram and the Delong test were generated by R software. Finally, the model established by logistic regression had highest diagnostic efficiency (area under the curve [AUC] = 0.912). Moreover, the diagnostic ability of the logistic model in the validation cohort (AUC = 0.882, 80.4% sensitivity, 75.7% specificity) was higher than that of the Peking University model (AUC = 0.712, 68.5% sensitivity, 70.3% specificity) and the Mayo model (AUC = 0.745, 62.0% sensitivity, 75.7% specificity). Interestingly, for the participants with intermediate (10‐30 mm) and CEA‐negative nodule, the model reached an AUC of 0.835 (72.3% sensitivity, 83.3% specificity). The AUC for the early lung cancer was as high as 0.822 with 67.3% sensitivity and 78.9% specificity. As a conclusion, this promising model presents a new diagnostic strategy for the clinic to distinguish MPNs from BPNs.

Details

Title

CT and CEA‐based machine learning model for predicting malignant pulmonary nodules

Author

Liu, Man¹

; Zhou, Zhigang²; Liu, Fenghui³; Wang, Meng²; Wang, Yulin⁴; Gao, Mengyu²; Sun, Huifang²; Zhang, Xue⁴; Yang, Ting⁵; Ji, Longtao⁵; Li, Jiaqi⁴; Si, Qiufang⁵; Dai, Liping⁵

; Ouyang, Songyun³

¹ Department of Respiratory and Sleep Medicine, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China; Henan Institute of Medical and Pharmaceutical Sciences & Henan Key Medical Laboratory of Tumor Molecular Biomarkers, Zhengzhou University, Zhengzhou, China
² Department of Radiology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
³ Department of Respiratory and Sleep Medicine, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
⁴ Henan Institute of Medical and Pharmaceutical Sciences & Henan Key Medical Laboratory of Tumor Molecular Biomarkers, Zhengzhou University, Zhengzhou, China
⁵ Henan Institute of Medical and Pharmaceutical Sciences & Henan Key Medical Laboratory of Tumor Molecular Biomarkers, Zhengzhou University, Zhengzhou, China; BGI College, Zhengzhou University, Zhengzhou, China

Pages

4363-4373

Section

ORIGINAL ARTICLES

Publication year

2022

Publication date

Dec 2022

Publisher

John Wiley & Sons, Inc.

ISSN

13479032

e-ISSN

13497006

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1111/cas.15561

ProQuest document ID

2753540375

CT and CEA‐based machine learning model for predicting malignant pulmonary nodules

Jump to:

Full text

Abstract

Details

Suggested sources