Full Text

Turn on search term navigation

1. Introduction

An important task in machine learning and artificial intelligence is to build a classifier for prediction and classification given available training samples or other prior information [1]. Statistical Learning Theory (SLT) plays an important role in prediction estimation and machine learning in that it provides a unified framework for solving finite-sample learning problems [2,3,4]. In the past few decades, Support Vector Machine (SVM), a method based on SLT, has been widely applied for the construction of a classifier when limited samples are available [5,6,7]. Unlike other classifiers, which assume that the number of samples can tend to infinity, SVM offers the best generalization and prediction capability after it is trained using limited samples. SVM is one among many classification methods capable of handling large or small samples, nonlinear, or those with high dimensional issues [8]. A vital question, however, in using SVM is still unresolved, namely, how many samples are necessary for a certain standard of prediction accuracy. To address this question, we devise here a new idea that not only pays attention to the performance of the model on the training set, but also puts additional focus on the model’s prediction accuracy of unknown new data. Typically, a model overfits when it is trained so well on the training data that it fails to generalize. Accordingly, the ability of the model to adapt to new data decreases. It is generally accepted that the primary factor in generalization performance is sample size. However, very little study has focused on studying the quantitative relationship between sample size and generalization. In this paper, we use the medical diagnosis of lung cancer based on computed tomography (CT) images as an example to solve the relationship problem; wherein, it is difficult to obtain large-scale samples in practice.

The construction of any classifier must include training and testing processes according to available samples. In this context, training error refers to the learning ability whereas testing error reflects the prediction and generalization ability [9]. In SLT, a learning error is called an empirical risk (EMP) while a generalization error is called an expected risk (EXP) [10]. Minimizing EMP is a necessary condition for building a classifier with generalization ability or an EXP, but it is not a sufficient condition [11]. In fact, a low EMP may lead to a high EXP, wherein the prediction and generalization ability involves both the number of samples and choice of the classifier type. Under the same EMP, a classifier with simple structure can often lead to a high generalization ability [12]. Nevertheless, SLT can act as the theoretical basis that connects EMP and EXP [13].

Recently, a number of methods have been proposed to diagnose lung cancer using CT images [14,15]. Invasive adenocarcinoma (IA) is a primary type of lung cancer and often presents as ground glass nodules (GGNs) in patient’s CT images [16]. Accurately discriminating IA from non-IA based on GGNs has critical implications for providing an effective treatment plan. Existing studies on classifying IA and non-IA have mostly focused on how minimizing EMP can improve classification accuracy and with very little consideration given to EXP. Moreover, the generalization and prediction ability of the classifier is not guaranteed in practice. To date, little is known about the interrelation between EXP and EMP, and more specifically, it remains unclear as to how many samples are necessary under a certain requirement of prediction accuracy. In this paper, the effect of different sample sizes on EXP is explored, and the minimum sample size is estimated to achieve the desired generalization requirements.

2. Materials and Methods

For this study, we retrospectively collected 436 early GGNs samples from the Tianjin General Hospital, in which the sample rate between IA and non-IA was 216:220. The segmented CT images of GGNs are shown in Figure 1. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and all experiments were approved by the ethics committee of General Hospital of Tianjin Medical University.

After obtaining a sample set of GGNs, we extracted features from each GGN sample for classifying IA and non-IA. Previous studies have demonstrated how the powerful modeling and feature learning capabilities of deep learning lead to remarkable results [17,18,19,20,21], especially in biomedical image processing applications based on quantitative CT images [22,23,24].

Transfer learning (TL) is a new research area in deep learning. TL transfer the knowledge obtained by a pre-trained network to other image processing tasks [25]. TL methods, when combined with a convolutional neural network (CNN), have been shown as effective solutions to problems associated with relatively small datasets [26]. Specifically, the deep feature extraction capability of TL methods can activate layers of a CNN model to construct the related feature vectors [27]. In the medical field, use of mainstream networks as feature extractors has become a typical and general method to deal with classification problems [28,29,30].

In this work, we use the AlexNet model as the feature extractor to construct a 4096-dimensional feature vector for each GGN sample. The AlexNet model is the first deep CNN model that was trained on the ImageNet competition. It has shown efficacy in image processing [31,32,33] and is convenient in terms of implementation and the ability to fine-tune its processes for new tasks. AlexNet has no complex or deep architectures, which means reduced computation time and the need for robust hardware. In addition, our previous research [34] has shown the effectiveness of features extracted by the fully connected (FC) layer of AlexNet for GGNs classification.

To find the key features in the sample set, we first use the p-value calculated by the Analysis of Variance (ANOVA) as a criterion for feature selection [35]. Typically, lower p-values correspond to larger between-class significant differences. The top ten features with the lowest p-value are selected to form a feature subset.

According to the general principle in SLT, EXP and EMP obey the following relationship with at least (1 − η) probability for a limited number of samples n [36]:

(1) $R_{e x p} (w) \leq R_{e m p} (w) + \sqrt{\frac{h (\ln (2 n / h) + 1) - \ln (η / 4)}{n}},$

where the value of EXP is denoted as R_exp(w) and the value of EMP as R_emp(w).

Equation (1) can be simplified as

(2) $R_{e x p} (w) \leq R_{e m p} (w) + Φ (n / h) .$

where Φ(n/h) is the confidence range; η is confidence limits, expressed as the probability that the classifier predicts the outcome to occur; n is the number of training samples; and h is VC dimension, indicating the maximum number of samples that a selected type of classifier can distinguish all possible 2^h sample combinations [37]. Only few types of classifiers have the known VC dimensions. Essentially, for the most used linear classifiers, the VC dimension is (d + 1), and d is the number of used features. The widely used SVM is just a linear classifier. When dealing with multi-classification problems using the linear classifier, the actual VC dimension in a one-to-rest manner is (d + 1) × c, where c is the number of classes [38]. From Equation (1), the EXP of the classifier consists of two parts: one part is the EMP; the other part is the confidence range, which is related to the VC dimension and the number of training samples. As the number of samples n increases, the confidence interval decreases monotonically, and the probability that Equation (1) holds increases monotonically.

We build four subsets of different sample sizes by random sampling. The SVM, which is the applicable form of SLT, is used for classifying IA and non-IA. We quantitatively estimate the effect of sample size to the generalization ability and compare the results between the computed EXP using Equation (2) and the actual test from the experiments. The schematic diagram of the experimental design is shown as Figure 2.

3. Results

First, 36 samples in the sample set were randomly selected from the overall 436 GGNs samples as a fixed test set. Four training subsets with different but increasing sample numbers were drawn from the remaining 400 samples as training sets; the proportion between positive (IA) and negative (non-IA) samples was 1:1. Therefore, the number of samples in the four group of training sets were 100, 200, 300, and 400, respectively. We repeated four sets of experiments with different training sets of four sample sizes and a fixed test set. With four sets of sample sizes, the 10-dimensional feature subsets obtained from the AlexNet were inputted into SVM for classifying IA and non-IA. The EXP error of GGNs in the test set and the EMP error in the training set under different sample sizes were calculated separately. We compared the actual value from the test set with the theoretical value computed by Equation (1) at 90% (i.e., η = 0.10) and 95% (i.e., η = 0.05) confidence levels. The experimental results are shown in Table 1.

It can be seen from Table 1 that the actual values from the testing samples and the computed theoretical values of R_exp(w) in Equation (1) show a nearly consistent trend as the sample size increases. Their differences gradually decrease with increase in sample size n, while not exceeding the upper limit of the theoretical value of R_exp(w). The value of confidence range Φ(n/h) is inversely proportional to the sample size, given the same number of features used. Experimental results demonstrate that the generalization ability of the invasive classification model is proportional to the sample size.

From the above conclusions, we can see that it is a necessary condition to minimize only EMP, but it is not a sufficient condition when the number of samples is limited. To increase the generalization and prediction capability of the model, the selected types of classifiers must have as small a VC dimension as is possible. Moreover, there is a requirement to minimize the confidence range as well. This process of selecting the classifier model and learning algorithm can be referred to as adjusting the confidence range. However, due to a lack of theoretical guidance, currently this process in most existing methods is based on prior knowledge and experience, resulting in an excessive dependence on researchers.

In practice, another critical challenge for building an effective classifier is the question of the number of samples necessary under a certain requirement of prediction accuracy. According to Equation (1), we continue to explore the lower bound on sample size under various generalization requirements. The experimental results are shown in Table 2.

From Table 1, we see that the value of R_emp(w) is inversely proportional to the sample size when the VC dimension and confidence limits are unchanged, and the values are all lower than 1%. Therefore, we suppose R_emp(w) is 1% taking into consideration a degree of classification error, which means that the classification accuracy of the training set reaches 99%. Then, we compute the classification task using only 2 features, meaning the value of d is 2. We take into account that under the binary classification task and the multi-classification task by the linear classifiers, the corresponding VC dimensions are (d + 1) and (d + 1) × c, respectively. When dealing with classification tasks of 2, 4, 6, 8, 10 classes, VC dimension is calculated as 3, 12, 18, 24, 30, respectively. Finally, by plugging the VC dimensions, confidence limits η (η = 0.05, 0.1) and supposed R_emp(w) to Equation (1), the lower limit of the achievable generalization requirements can be calculated under different sample sizes as shown in Table 2. Table 2 also shows the minimum sample size required to meet different generalization requirements under various classification tasks. It also shows that if GGNs are divided into two classes, four classes, and six classes, it respectively requires at least 6000, 12,000, and 18,000 samples to achieve 90% generalization ability when using two features.

Furthermore, we compare the sample sizes between the actually used and the computed theoretical values from Equation (1) under a certain expected prediction accuracy and η = 0.05, where the actually used cases are taken from the research of Diego et al. [39], Mohammad et al. [40], and Dai [41], as shown in Table 3. We can see that all the actual sample sizes are greater than the lower bound of the computed sample sizes. This validates the estimated bound of sample size by Equation (1).

4. Conclusions

SLT plays an important role in building a classifier for prediction and classification in the medical diagnosis of lung cancer under available training samples or other prior information. In this paper, the SLT tool was used to analyze two key issues for design classification, i.e., one regarding the computation of the generalization ability of a classifier based on available samples, and its inverse problem, i.e., how many samples are necessary under a certain requirement of prediction accuracy. The research has guiding significance for establishing a necessary sample size under a certain requirement of prediction accuracy, such as the study of the medical diagnosis of lung cancer. Our experimental results validate our method in the process for classifying IA and non-IA. In the future, we will continue exploring the connection between the type of classifiers and generalization ability with SLT.

Author Contributions

Conceptualization, C.M. and S.Y.; methodology, C.M. and S.Y.; software, C.M.; validation, C.M.; writing—original draft preparation, C.M. and S.Y. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the ethics committee of General Hospital of Tianjin Medical University (IRB2020-YX-145-01, 2020/12/31).

Informed Consent Statement

Patient consent was waived due to the secondary use of medical record specimens by the ethics committee of General Hospital of Tianjin Medical University.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

View Image - Figure 1. Examples of IA and non-IA nodules images. (a–d) Segmented IA nodules, (e–h) segmented non-IA nodules. (a,e) Larger volume; (b,f) smaller volume; (c,g) more solid components; (d,h) less solid components.

Figure 1. Examples of IA and non-IA nodules images. (a–d) Segmented IA nodules, (e–h) segmented non-IA nodules. (a,e) Larger volume; (b,f) smaller volume; (c,g) more solid components; (d,h) less solid components.

Figure 2. Schematic diagram for classifying IA and non-IA based on SVM and GGNs.

Table 1

Comparison of theoretical and actual values of R_exp(w).

Number of Samples	R_emp(w)	EXP (η = 0.05)		EXP (η = 0.10)		R_exp(w) in Test Set
Number of Samples	R_emp(w)	Φ(n/h) in Equation (2)	Theoretical Value	Φ(n/h) in Equation (2)	Theoretical Value	R_exp(w) in Test Set
n = 100	0	0.6877	0.6877	0.6826	0.6826	0.361
n = 200	0.005	0.524	0.529	0.5207	0.5257	0.222
n = 300	0.003	0.4449	0.4479	0.4423	0.4453	0.194
n = 400	0.002	0.3954	0.3974	0.3932	0.3952	0.139

Table 2

The relationship between EXP and sample size.

	Number of Classes	n = 3000	n = 6000	n = 9000	n = 12,000	n = 15,000	n = 18,000	n = 21,000	n = 24,000
95% confidence (d = 2)	c = 2	0.1103	0.0833	0.0710	0.0635	0.0583	0.0544	0.0514	0.0490
	c = 4	0.1841	0.1386	0.1176	0.1047	0.0957	0.0890	0.0838	0.0795
	c = 6	0.2157	0.1624	0.1377	0.1225	0.1120	0.1040	0.0978	0.0928
	c = 8	0.2416	0.1820	0.1542	0.1372	0.1253	0.1164	0.1094	0.1037
	c = 10	0.2639	0.1989	0.1686	0.1499	0.1369	0.1272	0.1195	0.1132
90% confidence (d = 2)	c = 2	0.1091	0.0825	0.0704	0.0630	0.0578	0.0540	0.0510	0.0486
	c = 4	0.1835	0.1382	0.1172	0.1044	0.0955	0.0888	0.0836	0.0793
	c = 6	0.2151	0.1621	0.1374	0.1223	0.1117	0.1038	0.0976	0.0926
	c = 8	0.2411	0.1817	0.1540	0.1370	0.1251	0.1163	0.1093	0.1036
	c = 10	0.2634	0.1986	0.1683	0.1497	0.1367	0.1270	0.1193	0.1131

Note: The numbers in orange refer to the achievement of 90% generalization ability.

Table 3

The comparison results between the actual and the computed theoretical values on the lower bound of the sample size estimated in Equation (1).

Case	Number of Classes	Number of Features	EXP	Number of Samples (η = 0.05)
Case	Number of Classes	Number of Features	EXP	Actual Value	Theoretical Value
Diego et al. [39]	c = 2	d = 2	0.32	301	235
Mohammad et al. [40]	c = 2	d = 10	0.31	1000	776
Dai [41]	c = 2	d = 40	0.47	1000	958

References

1. Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, fand prospects. Science; 2015; 349, pp. 255-260. [DOI: https://dx.doi.org/10.1126/science.aaa8415] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26185243]

2. Golino, H.F.; Epskamp, S. Exploratory graph analysis: A new approach for estimating the number of dimensions in psychological research. PLoS ONE; 2017; 12, e0174035. [DOI: https://dx.doi.org/10.1371/journal.pone.0174035] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28594839]

3. Ha, M.H.; Tian, J. The theoretical foundations of statistical learning theory based on fuzzy number samples. Inf. Sci.; 2008; 178, pp. 3240-3246. [DOI: https://dx.doi.org/10.1016/j.ins.2008.03.025]

4. Loh, P.L. On lower bounds for statistical learning theory. Entropy; 2018; 19, 667. [DOI: https://dx.doi.org/10.3390/e19110617]

5. Jain, S. Computer-aided detection system for the classification of non-small cell lung lesions using SVM. Curr. Comput. Aided Drug Des.; 2020; 16, pp. 833-840. [DOI: https://dx.doi.org/10.2174/1573409916666200102122021]

6. Wang, J.; Liu, X.; Dong, D.; Song, J.; Xu, M.; Zang, Y.; Tian, J. Prediction of malignant and benign Lung tumors using a quantitative radiomic method. Acta Autom. Sin.; 2017; 43, pp. 2109-2114.

7. Ye, Y.; Tian, M.; Liu, Q.; Tai, H.-M. Pulmonary nodule detection using v-net and high-level descriptor based SVM classifier. IEEE Access; 2020; 8, pp. 176033-176041. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3026168]

8. Ladayya, F.; Purnami, S.W.; Irhamah,. Fuzzy Support Vector Machine for Microarray Imbalanced Data Classification. Proceedings of the 13th IMT-GT International Conference on Mathematics, Statistics and their Applications (ICMSA); Universiti Utara Malaysia, Kedah, Malaysia, 4–7 December 2017.

9. Ardila, D.; Kiraly, A.P.; Bharadwaj, S.; Choi, B.; Reicher, J.J.; Peng, L.; Tse, D.; Etemadi, M.; Ye, W.; Corrado, G. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med.; 2019; 25, pp. 954-961. [DOI: https://dx.doi.org/10.1038/s41591-019-0447-x]

10. Land, W.H.; Margolis, D.; Gottlieb, R.; Yang, J.Y.; Krupinski, E.A. Improving CT prediction of treatment response in patients with metastatic colorectal carcinoma using statistical learning. Int. J. Comput. Biol. Drug Des.; 2010; 3, pp. 8-15. [DOI: https://dx.doi.org/10.1186/1471-2164-11-S3-S15]

11. Muselli, M.; Ruffino, F. Consistency of Empirical Risk Minimization for Unbounded Loss Functions. Proceedings of the 15th Italian Workshop on Neural Nets; Perugia, Italy, 14–17 September 2004.

12. Zhang, X.; Ha, M.; Wu, J.; Wang, C. The Bounds on the Rate of Uniform Convergence of learning Process on Uncertainty Space. Proceedings of the 6th International Symposium on Neural Networks; Wuhan, China, 26–29 May 2009.

13. Chen, D.R.; Sun, T. Consistency of multiclass empirical risk minimization methods based on convex loss. J. Mach. Learn. Res.; 2006; 7, pp. 2435-2447.

14. Naik, A.; Edla, D.R. Lung Nodule Classification on computed tomography images using deep learning. Wirel. Pers. Commun. Vol.; 2021; 116, pp. 655-690. [DOI: https://dx.doi.org/10.1007/s11277-020-07732-1]

15. Alamgeer, M.; Mengash, H.A.; Marzouk, R.; Nour, M.K.; Hilal, A.M.; Motwakel, A.; Zamani, A.S.; Rizwanullah, M. Deep learning enables computer aided diagnosis model for lung cancer using biomedical CT images. Comput. Mater. Contin.; 2022; 73, pp. 1437-1448.

16. Nemec, U.; Heidinger, B.; Anderson, K.R.; Westmore, M.S.; VanderLaan, P.; Bankier, A.A. Software-based risk stratification of pulmonary adenocarcinomas manifesting as pure ground glass nodules on computed tomography. Eur. Radiol.; 2018; 28, pp. 235-242. [DOI: https://dx.doi.org/10.1007/s00330-017-4937-2] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28710575]

17. Dimitri, G.M.; Spasov, S.; Duggento, A.; Passamonti, L.; Toschi, N. Unsupervised Stratification in Neuroimaging through Deep Latent Embeddings. Proceedings of the 42nd Annual International Conference of the IEEE-Engineering-in-Medicine-and-Biology-Society; Montreal, Canada, 20–24 July 2020.

18. Shin, H.-C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging; 2016; 35, pp. 1285-1298. [DOI: https://dx.doi.org/10.1109/TMI.2016.2528162] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26886976]

19. Kim, D.H.; MacKinnon, T. Artificial intelligence in fracture detection: Transfer learning from deep convolutional neural networks. Clin. Radiol.; 2018; 73, pp. 439-445. [DOI: https://dx.doi.org/10.1016/j.crad.2017.11.015] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29269036]

20. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature; 2015; 521, pp. 436-444. [DOI: https://dx.doi.org/10.1038/nature14539]

21. Bengio, Y.; Lecun, Y.; Hinton, G. Deep learning for AI. Commun. ACM; 2021; 64, pp. 58-65. [DOI: https://dx.doi.org/10.1145/3448250]

22. Affonso, C.; Rossi, A.L.D.; Vieira, F.H.A.; de Leon Ferreira, A.C.P. D eep learning for biological image classification. Expert Syst. Appl.; 2017; 85, pp. 114-122. [DOI: https://dx.doi.org/10.1016/j.eswa.2017.05.039]

23. Rachapudi, V.; Devi, G.L. Improved convolutional neural network based histopathological image classification. Evol. Intell.; 2021; 14, pp. 1337-1343. [DOI: https://dx.doi.org/10.1007/s12065-020-00367-y]

24. Da Nóbrega, R.V.M.; Rebouças Filho, P.P.; Rodrigues, M.B.; da Silva, S.P.; Dourado Júnior, C.M.; de Albuquerque, V.H.C. Lung nodule malignancy classification in chest computed tomography images using transfer learning and convolutional neural networks. Neural Comput. Appl.; 2020; 32, pp. 11065-11082. [DOI: https://dx.doi.org/10.1007/s00521-018-3895-1]

25. Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Processing Syst.; 2012; 25, pp. 1097-1105. [DOI: https://dx.doi.org/10.1145/3065386]

26. Pan, S.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng.; 2010; 22, pp. 1345-1359. [DOI: https://dx.doi.org/10.1109/TKDE.2009.191]

27. Demir, F.; Bajaj, V.; Ince, M.C.; Taran, S.; Şengür, A. Surface EMG signals and deep transfer learning-based physical action classification. Neural Comput. Appl.; 2019; 31, pp. 8455-8462. [DOI: https://dx.doi.org/10.1007/s00521-019-04553-7]

28. Polat, H.; Mehr, H.D. Classification of pulmonary CT images by using hybrid 3D-deep convolutional neural network architecture. Appl. Sci.; 2019; 9, 940. [DOI: https://dx.doi.org/10.3390/app9050940]

29. Nalik, A.; Edla, D.R. Lung nodules classification using combination of CNN, second and higher order texture features. J. Intell. Fuzzy Syst.; 2021; 41, pp. 5243-5251.

30. Dutta, A.K. Detecting lung cancer using machine learning techniques. Intell. Autom. Soft Comput.; 2021; 31, pp. 1007-1023. [DOI: https://dx.doi.org/10.32604/iasc.2022.019778]

31. Shayma’a, A.H.; Sayed, M.S.; I Abdalla, M.I.; Rashwan, M.A. Breast cancer masses classification using deep convolutional neural networks and transfer learning. Multimed. Tools Appl.; 2020; 79, pp. 30735-30768.

32. Cengil, E.; Nar, A. The effect of deep feature concatenation in the classification problem: An approach on COVID disease detection. Int. J. Imaging Syst. Technol.; 2021; 32, pp. 26-40. [DOI: https://dx.doi.org/10.1002/ima.22659]

33. Hou, X.X.; Xu, X.Z.; Zhu, J.; Guo, Y. Computer aided diagnosis method for breast cancer based on AlexNet and ensemble classifiers. J. Shandong Univ. Eng. Sci.; 2019; 49, pp. 74-79.

34. Ma, C.C.; Yue, S.H.; Li, Q. Deep Transfer Learning Strategy for Invasive Lung Adenocarcinoma Classification Appearing as Ground Glass Nodules. Proceedings of the 2021 IEEE International Instrumentation and Measurement Technology Conference; Glasgow, UK, 17–20 May 2021.

35. Monica, L.; Gaddis, P.D. Statistical methodology: Analysis of variance, analysis of covariance, and multivariate analysis of variance. Acad. Emerg. Med.; 1998; 5, pp. 258-265.

36. Song, X.S.; Jiang, X.Y.; Luo, J.H.; Wang, X. SVM parameter selection based on the bound of structure risk. Sci. Technol. Rev.; 2011; 29, pp. 72-74.

37. Vaart, V.D.; Wellner, J.A. A note on bounds for VC dimensions. Inst. Math. Stat. Collect.; 2009; 5, pp. 103-107. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20729995]

38. Jayadeva,. Learning a hyperplane classifier by minimizing an exact bound on the VC dimension. Neurocomputing; 2015; 149, pp. 683-689. [DOI: https://dx.doi.org/10.1016/j.neucom.2014.07.062]

39. Diego, M.; Rodrigo, M. Improvement of patient classification using feature selection applied to Bi-Directional Axial Transmission. IEEE Trans. Ultrason. Ferroelectr. Freq. Control.; 2022; [DOI: https://dx.doi.org/10.1109/TUFFC.2022.3195477]

40. Mohammad, A.A.; Mwaffaq, O.; Hamza, J. Comprehensive and comparative global and local feature extraction framework for lung cancer detection using CT scan images. IEEE Access; 2021; 9, pp. 158140-158154.

41. Dai, C. SVM Visual Classification Based on Weighted Feature of Genetic Algorithm. Proceedings of the 2015 Sixth International Conference on Intelligent Systems Design and Engineering Applications; Guiyang, China, 18–19 August 2015.

Word count: 3869

Show less

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Statistical Learning Theory (SLT) plays an important role in prediction estimation and machine learning when only limited samples are available. At present, determining how many samples are necessary under given circumstances for prediction accuracy is still an unknown. In this paper, the medical diagnosis on lung cancer is taken as an example to solve the problem. Invasive adenocarcinoma (IA) is a main type of lung cancer, often presented as ground glass nodules (GGNs) in patient’s CT images. Accurately discriminating IA from non-IA based on GGNs has important implications for taking the right approach to treatment and cure. Support Vector Machine (SVM) is an SLT application and is used to classify GGNs, wherein the interrelation between the generalization and the lower bound of necessary sampling numbers can be effectively recovered. In this research, to validate the interrelation, 436 GGNs were collected and labeled using surgical pathology. Then, a feature vector was constructed for each GGN sample through the fully connected layer of AlexNet. A 10-dimensional feature subset was then selected with the p-value calculated using Analysis of Variance (ANOVA). Finally, four sets with different sample sizes were used to construct an SVM classifier. Experiments show that a theoretical estimate of minimum sample size is consistent with actual values, and the lower bound on sample size can be solved under various generalization requirements.

Details

Title

Minimum Sample Size Estimate for Classifying Invasive Lung Adenocarcinoma

Author

Ma, Chenchen; Yue, Shihong

First page

8469

Publication year

2022

Publication date

2022

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app12178469

ProQuest document ID

2711271507

Minimum Sample Size Estimate for Classifying Invasive Lung Adenocarcinoma

Jump to:

Full Text

Abstract

Details

Suggested sources