Content area
Colorectal cancer is a significant global health issue, ranking as the third most common cancer and the second leading cause of cancer-related deaths worldwide. Early diagnosis of this disease is of utmost importance to increase the survival rate and enhance the healthcare system. Many machine learning (ML) and deep learning (DL) methods have been proposed to facilitate automated early diagnosis of this cancer. However, label noise in medical images and the dependence on a single model can lead to suboptimal model performance, which could potentially hinder the development of a sophisticated automated solution. In this paper, we address label noise in training data and propose a stacking-ensemble model for classifying colorectal cancer along with a trustworthy computer-aided diagnosis (CAD) system. Initially, a variety of filtering methods are extensively analyzed to determine the most suitable image representation, with subsequent data augmentation techniques. Second, a modified VGG-16 model was proposed with fine-tuning that was utilized as a feature extractor to extract meaningful features from the training samples. Third, a prediction uncertainty and probabilistic local outlier factor (pLOF) were applied to the extracted features to address the label noise issue in the training data. Fourth, we adopted a random forest–based recursive feature elimination (RF-RFE) feature selection method with various combinations of features to recursively select the most influential ones for accurate predictions. Fifth, four base ML classifiers and a metamodel were selected to build our final stacking-ensemble model, which integrates the prediction probabilities of multiple models into a meta-feature set to ensure trustworthy predictions. Finally, we integrated these strategies and deployed them into a web application to demonstrate a CAD system. This system not only predicts the disease but also generates the prediction probabilities of each class, which enhances both clarity and diagnostic insight. Our proposed model was compared with different state-of-the-art ML classifiers on a publicly available dataset and demonstrated the highest accuracy of 92.43%.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Colorectal cancer (CRC), also known as bowel cancer, typically develops either individually or collectively in the colon or rectum [1]. The initial stage of this cancer is indicated by forming a polyp (a small extension of excess tissue with the shape of a ball) inside the lining of the colon or rectum, and the majority of CRC cases are adenocarcinomas [2]. According to the World Cancer Research Fund (WCRF) and the World Health Organization (WHO), CRC is the third most common cancer in the world. Notably, it is the second most common cancer among women and in Europe [3], which highlights CRC as an alarming disease all over the world. Early tumor identification plays a crucial role in determining the survival rate of cancer patients during treatment. It is optimistic that the cure rate for CRC rises to about 90% with an early diagnosis [4]. In Europe, the survival rate is 93% after 3 years when diagnosed at Duke Stage A but drops drastically to 16% after 3 years for Stage D tumors [5]. Therefore, if colorectal cancer is detected at an earlier stage in patients, their survival rate will increase, strengthening healthcare services for them as well. CRC is usually diagnosed in patients through the onset of symptoms, a screening colonoscopy, or noninvasive stool-based tests, such as fecal occult blood testing that includes fecal immunochemical tests or guaiac fecal occult blood tests. Histopathology is the process of putting tissues on a microscope slide and then staining them with hematoxylin and eosin (H & E) to discern the nucleus, lumen, and other components [6]. Pathologists use the gland shape, features, nucleus size, and formation as diagnostic criteria for CRC, and based on their geometrical alterations, they grade those for malignancy. However, these procedures are time-consuming and exhausting, leading to ocular fatigue and potentially impacting the pathologist’s judgment negatively [7–9]. In recent years, researchers have utilized a range of machine learning (ML) and deep learning (DL) [4] techniques to develop computer-aided diagnosis (CAD) systems for analyzing medical images. These advanced CAD systems are invaluable tools for radiologists and clinicians, aiding in the early detection and accurate diagnosis of various diseases, including CRC. This system accurately observes and characterizes histological image constituents, such as glands and goblet cells, on the epidermis to distinguish between infected and healthy structures for the diagnosis of CRC [7].
Conventional ML methods often struggle with medical image data due to complex patterns and the reliance on handcrafted features. In addition, medical image datasets are sometimes small and may contain mislabeled samples or label noise, which can negatively affect model performance and generalization. Training on such noisy or limited data may lead to poor predictive performance and reduced reliability. Relying on a single model for classification in these scenarios can result in biased or unreliable outcomes. However, ensemble learning offers a more trustworthy approach to classification by leveraging the strengths of multiple models. In our study, we developed a robust framework for classifying CRC using histopathological images by integrating conventional ML methods with advanced DL techniques. We utilized CNN to extract meaningful features from the images and then applied ML algorithms with feature selection techniques to classify these features into different cancer stages. This hybrid approach leverages the strengths of both methods, enhancing the accuracy and reliability of our framework as an early diagnosis tool for CRC. Moreover, it reduces computational costs, making it more suitable for practical implementation in CAD systems. The overall contribution of our study is as follows:
1. We utilized various filtering methods in our preprocessing step to achieve optimal image representation and applied data augmentation techniques that effectively increased the dataset by generating different representations during training.
2. We proposed a modified VGG-16 CNN model as our feature extractor, which extracted 1024 features from input images and determined the training loss for each sample.
3. To address label noise, we estimated prediction uncertainty using cross-entropy loss and probabilistic local outlier factor (pLOF) score for each sample. We identified clean and noisy samples based on these scores and selected 70% of the clean samples as our final training data.
4. We applied feature selection using RFE, which iteratively selects the most important features from clean samples for final classification.
5. We implemented a stacking-ensemble (SEnse) learning strategy to ensure trustworthy predictions by combining the outputs of multiple individual models for the final classification.
6. Finally, we developed a publicly accessible demo web application for real-time CRC prediction using histopathological images.
The remainder of the paper is arranged as follows. A literature review is covered in Section 2. The methodology part of this experiment is discussed in Section 3 along with dataset presentation and preprocessing. The methodology of uncertainty on cross-entropy loss, label noising, feature selection, and stacking methods are also covered in Section 3. The experimental result is provided in Section 4. Finally, Section 5 contains the conclusion for this work.
2. Literature Review
In recent years, image processing techniques and ML approaches have been proposed to identify various cancer diseases and develop CAD systems. ML techniques empower computers to autonomously learn from image data, and numerous studies have employed these methods to contribute to the advancement of clinical diagnosis, especially for CRC patients [10–16]. Kather et al. [17] introduced the MNIST CRC dataset to aid clinicians in early CRC diagnosis through ML approaches. Local binary patterns (LBPs), Gabor filters, and gray-level co-occurrence matrix (GLCM) were employed in their study to extract features from histopathological images. Four ML classifiers, including nearest neighbor (NN), linear support vector machine (SVM), radial basis function (RBF) SVM, and ensemble of decision trees, were utilized to classify these features. The RBF-SVM demonstrated the best classification results with combined feature sets, improving the accuracy of tumor-stroma separation. However, the overall results for the eight classes did not meet expectations. Alqudah [18] introduced 3D GLCM and applied different color spaces to the original images, including RGB, HSV, and LAB. They calculated 3D co-occurrence matrices from Distances 1 and 2 at 13 angles to combine the extracted features using these matrices. For the final classification, they employed five classifiers, including SVM, artificial neural network (ANN), K-nearest neighbor (KNN), classification decision tree (CDT), and quadratic discriminant analysis (QDA). The QDA classifier reported the highest accuracy for RGB, HSV, and LAB color space images in their study. The advancement of CNNs enables DL methods to consistently outperform conventional ML approaches, especially in achieving expert-level accuracy for image-based classification [19]. In CRC cases, CNN has demonstrated significantly improved performance [20] by extracting prognostic indicators from images of tissue slides stained with HE. Due to the high fatality rate of the disease, the application of DL methods in diagnosing colon cancer has received increased attention in numerous histopathological imaging studies [21–24]. However, challenges remain, including potential misclassifications between muscle and stroma classes, as well as between lymphocytes and debris/necrosis [25].
In 2017, Cassinelli et al. [26] proposed three CNN architectures to extract meaningful features from histological images, including VGG-F, VGG-S, and VGG-VD-16. They applied principal component analysis (PCA) and Gaussian random projection (GRP) techniques to reduce the dimensions of the extracted features before performing the final classification. The objective of their study was to find the trade-off between accuracy and dimensionality. To address this, a correlation-based feature reduction (CBFS) method for dimensionality reduction was proposed in their study. Their study reported that GRP and CBFS exhibited more stable performance in terms of overall accuracy. Ciompi et al. [27] investigated the impact of stain normalization on tissue classification in H & E-stained images of CRC samples. They proposed a convolutional neural network (ConvNet)–based method for CRC tissue classification in their study. Xu et al. [25] employed transfer learning for both segmentation and multiclassification in CRC. Their study specifically focused on investigating the impact of activation functions in the hidden layers. Bayramoglu and Heikkilä [28] employed different pretrained models and investigated the effectiveness of transfer learning and fine-tuning. They categorized CRC cell nuclei into three groups for the final classification, including epithelial, fibroblast, and inflammatory nuclei. Ohata et al. [29] employed various CNN models and ML classifiers to classify CRC tissues. They utilized CNN models for deep feature extraction through transfer learning and ML classifiers for the subsequent classification of these extracted features. The study aimed to identify the optimal feature extractor and classifier combination, and they found DenseNet-169 with SVM (RBF) as the most effective combination. Five pretrained CNN models, including AlexNet, GoogleNet, SqueezeNet, VGGNet, and ResNet-50 have been employed in Tsai and Tao’s [30] study for classifying CRC tissues. They utilized transfer learning to enhance the classification performance. Damkliang, Wongsirichot, and Thongsuksai [31] proposed a pixel-level categorization of CRC using the VGG-16 network with the Adam optimizer. They incorporated rescaling and data augmentation as preprocessing steps in their work; however, their model exhibited extensive training times with expensive computational costs. Manivannan et al. [32] employed the histopathology image (HI) of colon regions as a starting point for segmenting glandular structures. A structured learning framework was employed in their study to show how class labels are arranged spatially and capture structural details that are frequently lost by sliding window approaches. Chen et al. [33] proposed a weakly supervised CRC classification approach utilizing a multichannel attention mechanism. Their proposed framework contains two stages, including automatic learning (AL) and interactive learning (IL). In the AL stage, multichannel features are extracted through three attention mechanism channels and CNN. The IL stage employs an interactive process, continually incorporating misclassified images into the training set to enhance the model’s classification ability. Wang et al. [34] proposed an innovative patch aggregation method for classifying CRC using patches of poorly labeled diseased slide images. These patches were identified based on the CNN performance, and their method was trained and validated using numerous. Ho et al. [35] proposed a faster region–based CNN (Faster-RCNN) based on Qritive’s unique composite algorithm with a ResNet-101 backbone. Their study provides both instance segmentation of glandular and classification. Riasatian et al. [36] introduced KimiaNet for classifying histopathological images with various configurations. Their proposed model follows the DenseNet architecture and contains four dense blocks in each layer. Yildirim and Cinar [37] proposed a CNN-based MA_ColonNET consisting of a 45-layer model for classifying colon cancer. Kumar, Vishwakarma, and Bajaj [38] proposed a lightweight framework, namely, CRCCN-Net, to ensure lower computational costs than pretrained models. They compared their model with various pretrained models, including DenseNet-121, InceptionResNetV2, Xception, and VGG-16. The experimental results demonstrated that their developed model requires a lesser computational cost for training compared to the employed pretrained models. Jara and Bowen [39] utilized the VGG-16 CNN model and explored various optimizers to determine the best one for classifying CRC. They employed stochastic gradient descent (SGD) + momentum, SGD + Nesterov, adaptive moment estimation (Adam), and adaptive gradient algorithm (AdaGrad) optimizers in their work. The Adam optimizer demonstrated the most optimal performance, and they selected it for their final classification model. Zeid, El-Bahnasy, and Abo-Youssef [40] emphasized the potential applications of transformers in HIs to enhance classification accuracy. They employed a vision transformer and a compact convolutional transformer in their work and demonstrated improved accuracy. Reis and Turk [41] employed the DenseNet-169 model to classify CRC. Their study involved exploring two approaches: initially training the model with random weights and subsequently using pretrained weights. The findings indicated that the utilization of pretrained weights outperformed the random weights approach. Khazaee Fadafen and Rezaee [42] employed a dilated ResNet (dResNet) structure and an attention module for deep feature extraction. They applied neighborhood component analysis (NCA) to reduce the computational complexity of the extracted features. For the final classification, they utilized an ensemble learning-based classifier, namely, DeepSVM.
3. Methodology
This section describes all of our employed methodologies including dataset, preprocessing, feature extraction, addressing label noise, feature selection, and SEnse model. Figure 1 represents the schematic diagram of our proposed methodologies.
[figure(s) omitted; refer to PDF]
We have designed a system algorithm (Algorithm 1), which represents the whole procedures for classifying CRC with our framework, given as follows.
Algorithm 1: Proposed Algorithm.
Input:
Output: The assessment metrics on the test dataset.
Dataset prepossessing:
5.
6.
7. Initialize the modified VGG-16 model with transfer learning
Feature extraction:
8. for local epoch
9. for
10. Optimize model parameters
11.
12.
13.
14. end for
15. end for
Addressing label noise:
16. for each sample
17.
18. end for
19.
Feature selection:
20.
CRC classification with the SEnse model:
21. Train base models
21. for each base model
22.
23.
24. end for
25.
26. for each trained base model
27.
28.
29. end for
30. TrainedMetamodel
31. Pred ⟵ TrainedMetamodel (
32. Evaluation metrics ⟵ ComputeMetrics(Pred,
3.1. Dataset
In this study, we have employed the Colorectal Histology MNIST (Modified National Institute of Standards and Technology) dataset to conduct experiments with our proposed framework. This dataset consists of a collection of textures in colorectal cancer histology [43]. Figure 2 represents some of the sample images from this dataset. It contains 5000 histological images of human colorectal cancer and reflects around eight tissue categories or classes. The tissue categories are (1) tumor, (2) stroma, (3) complex, (4) lympho, (5) debris, (6) mucosa, (7) adipose tissue, and (8) empty tissue. Each class contains 625 images in RGB format with a size of 150 × 150 pixels, which is about 74 × 74 μm. The images have a resolution of 0.495 μm per pixel and were digitized using an Aperio ScanScope (Aperio/Leica Biosystems) at 20x magnification. We split this dataset into training and test sets with an 80:20 ratio, resulting in a training set of 4000 images and a test set of 1000 images.
[figure(s) omitted; refer to PDF]
3.2. Data Preprocessing
The preprocessing of data involves specific steps taken before conducting experiments to enhance the data quality for better outcomes. In our study, we considered various filtering methods, augmentation, and image scaling as preprocessing steps. At first, we applied seven filtering methods (bilateral filter, histogram equalization, edge detection and sharpening, gamma contrast, Gaussian, unsharp mask, and median blur) to the original images. Subsequently, we generated distinct training and test sets for each filtering method to determine the most effective one for our dataset. These filters are utilized individually, and to some extent, we combined any two among these seven filters, such as histogram equalization + bilateral filter and histogram equalization + gamma contrast. After that, we applied different augmentation techniques (width shifting, height shifting, zooming, vertical flip, horizontal flip, fill mode, and rotation) to the selected filtered dataset. Finally, we conducted image scaling on all our training images, transforming pixel intensities into a range of (0, 1) by dividing each pixel value by 255. This facilitates faster convergence during training with a lower computational cost. Some filtered images are shown in Figure 3.
[figure(s) omitted; refer to PDF]
3.3. Feature Extraction Using Modified VGG-16
The feature extraction process allows us to extract meaningful features from input images and plays a crucial role in any classification task. We have developed a modified VGG-16 [44] model that integrates feature extraction and prediction loss calculation to quantify the prediction uncertainty of a sample in a single forward pass. A transfer learning approach has been utilized in our modified model, where we have used pretrained weights of VGG-16 for convolutional layers that were trained on ImageNet [45].
Typically, VGG-16 has 13 convolutional layers, five pooling layers, and three fully connected layers. We have restructured its fully connected layers for more efficient feature extraction. The original model takes an input image with a size of 224 × 224 × 3 and generates 512 × 7 × 7 feature maps using its convolutional and pooling layers during the feature extraction process. These feature maps are then flattened into a 1D vector with a shape of 1 × 25,088 and fed to the fully connected layers for classification. Typically, the first two fully connected layers contain 1 × 4096 vectors, and the last layer contains 1000 nodes for classification. In our case, we used a 1 × 4096 vector in the first layer, a 1 × 1024 vector in the second layer, and 8 nodes in the last layer for classification (as per our number of classes). We have added a forward hook [46] at the second fully connected layer that allows us to capture the feature vector from this layer directly during the forward pass. These features have a reduced dimensionality of 1 × 1024 and are more informative as they directly contribute to the final classification. This modification enables our model to extract 1024 features and calculate prediction loss in a single forward pass by simultaneously obtaining both feature representations and softmax probabilities for each input sample. The prediction loss provides an effective measure for quantifying uncertainty, while the extracted features are used to identify outliers through pLOF, which helps determine whether a sample is clean or noisy.
We used ReLU as the activation function and Adam as the optimizer, with a learning rate of 0.001, β1 of 0.9, and β2 of 1 × 10−7. This modified model was trained for 150 epochs to enhance its generalization capabilities. Afterward, we extracted 1024 features and calculated their corresponding prediction loss from the trained model for the next step.
3.4. Prediction Uncertainty With Cross-Entropy Loss
The experimental dataset may contain label noise due to the variability and intracomplexity patterns in medical images. Therefore, noisy labeled samples will have higher uncertainty, which will hinder the model’s generalization capability as DL models prioritize learning patterns first. Several uncertainty methods [47, 48] have been employed previously to calculate the probability that a sample is either clean or noisy. Among them, a small loss approach [49] is widely utilized in the literature, which signifies that a sample with a larger loss has higher uncertainty and a smaller loss has lower uncertainty. We implemented the small loss approach using cross-entropy loss to determine the prediction uncertainty of a sample. The cross-entropy loss utilizes prediction probabilities to calculate the loss value, measuring how closely the predicted class probabilities align with the true class labels. Noisy samples deviate from the typical patterns of their assigned classes, which leads to uncertain predictions with lower confidence scores. This results in an increased cross-entropy loss and indicates higher uncertainty in the prediction for those samples. A categorical cross-entropy loss function is utilized in our work, as our dataset contains 8 classes. The smaller value of the cross-entropy loss function indicates that the prediction is more accurate, and the model is predicting more confidently. The equation of cross-entropy loss can be defined as follows:
3.5. Addressing Label Noise With Uncertainty and pLOF
In the context of medical images, label noise refers to annotation errors or misclassified samples, which arise from imaging equipment, image acquisition parameters, and the complex pattern of the images. Since our study is based on a dataset of medical images, it may contain label noise. The pLOF [50] is an unsupervised algorithm that provides an outlier probability of a sample, enabling us to distinguish noisy and clean samples based on its outlier probability [51]. Basically, the pLOF value is determined by considering local neighbors centered around the sample and measuring the distance between each neighbor and the sample point. pLOF can be described as with the following properties:
These distances
The pLOF value of an object
After applying pLOF, we calculated the pLOF score or outlier probability for all samples. To address the issue of label noise, we consider both the prediction uncertainty and the pLOF score of individual samples to determine whether a sample is noisy or clean. A clean sample is characterized by lower values of both uncertainty and pLOF score, whereas a noisy sample has higher values. Based on these criteria, we sorted the dataset in ascending order, ensuring that samples with lower outlier probabilities and loss appear at the top. From this sorted dataset, we retained the top 70% as the final training data, as they were identified as clean samples, while the remaining 30% were discarded as noisy samples due to their higher uncertainty and outlier probability. Finally, we trained our proposed SEnse model with the selected training data to ensure a more reliable and accurate model performance.
3.6. Feature Selection Using RF-RFE
We have extracted 1024 features from each sample during the feature extraction process; this amount of features may contain many irrelevant features and require a higher computational cost. Therefore, we have applied the RFE [52] method to implement feature selection on our extracted features. It works in a recursive manner, using the greedy algorithm to determine the most influential features that contribute to the final prediction. It involves training the estimator iteratively by selecting important features based on their coefficients or importance scores, which determine the output, and discarding redundant features in the next iteration [53]. This process continues recursively until the desired number of features is obtained. Mathematically, we can express RFE in the following way.
Let us assume,
Furthermore, let
The feature that has the least impact on the performance metric will be removed after each iteration, and again, the estimator will be trained with the rest of the features. In this way, these functions will be iterated until the desired number of features are obtained that are the most relevant. We have employed a RF-RFE with a Gini criterion (a function to measure the quality of a split) in our proposed feature selection method. The RF estimator contains 1200 trees with a max depth of 20, a minimum of 2 samples for a split, and 1 sample to be a leaf node. We explored various feature combinations, including 20%, 40%, 50%, 60%, and 80% of the 1024 features, to get the optimal number of features that provided the best classification result.
3.7. SEnse Technique
Stacking is an ensemble strategy of ML that merges various models to boost the accuracy of predictions [54]. Unlike majority voting ensembles or simple averaging, stacking [55, 56] aims to leverage the strengths of various models by training a metamodel on the outputs (predictions) of the base models. Majority voting and simple averaging treat predictions equally without fully utilizing the probabilistic outputs of individual classifiers. In stacking, the metamodel learns to interpret these prediction probabilities, effectively capturing more nuanced patterns and dependencies among features, resulting in improved predictive performance. A single model may make errors in predicting certain samples or be biased towards specific classes. In contrast, SEnses utilize the diverse learning capabilities of multiple base classifiers by combining their prediction probabilities into a new feature set for training a metamodel. This additional layer of learning helps to find the optimal combination of predictions, which ensures a more precise and reliable final output compared to any single base model alone.
In general, SEnse learning comprises two phases: (i) base model training and (ii) and metamodel training. The choice of base model and metamodel is very crucial in SEnse learning. We have chosen four well-known ML classifiers, including SVM [57], KNN [58], RF [59], and multilayer perceptron (MLP) [60], as our base models and logistic regression (LR) [61] as our metamodel. We can represent the base models as
In our metamodel, we applied an L2 regularization method that can potentially reduce overfitting, characterized by a tolerance for convergence set at 0.9. The LR estimator was rigorously trained with a fixed value of 0.7 for the inverse regularization strength parameter (C). The inner workings of our LR model relied on the Newton–Cholesky solver, which is a specialized optimization algorithm. We thoughtfully constrained the model by setting the maximum number of iterations at 100. In some circumstances, the input to the meta-learner may include not just the raw prediction probabilities but also the original input characteristics. This approach ensures more reliable predictions compared to standalone models by mitigating the biases and errors inherent in individual classifiers. By training a metamodel on the probabilistic outputs of various base classifiers, stacking can capture complex patterns and dependencies that standalone models might miss. The metamodel effectively learns to assign weights and combine these probabilities, enhancing the overall robustness and generalization capability of the ensemble. We have chosen LR as a metamodel based on its performance compared to other estimators.
4. Experimental Results
In this section, we have represented the experimental results of our study for classifying CRC. All of our employed approaches including feature extraction methods, addressing label noise, feature selection techniques, and classification algorithms have been implemented in Python language (Python 3.10 version) with different packages, such as NumPy, Pandas, Scikit-learn, TensorFlow, and PyTorch. The implementations of all experiments have been conducted within the Google Colab environment that includes an NVIDIA Tesla T4 graphics card with 15 GB RAM and 131 GB disk space.
4.1. Evaluation
We employed a variety of performance metrics to assess the performance of our proposed model. Accuracy, sensitivity, specificity, Matthew’s correlation coefficient (MCC), and AUC have been utilized for evaluation, which are defined as follows:
Here, true positive (TP) denotes the total number of correctly predicted samples as belonging to their respective class and true negative (TN) denotes the total number of instances correctly predicted as not belonging to their respective class. Similarly, false positive (FP) is the count of instances incorrectly predicted as belonging to a class when they do not, and false negative (FN) is the count of instances incorrectly predicted as not belonging to a class when they actually do.
4.2. Efficiency of Bilateral Filter and Data Augmentation
We employed four distinct individual filters and three composite filter combinations on the original dataset to determine the best filtering technique for our proposed model. A series of comparative experiments have been conducted on those filtering methods, which include bilateral, Gaussian, median, unsharp mask, a combination of histogram equalization and bilateral, sharpen, and edge detection, along with histogram equalization and gamma contrast. In addition, we have evaluated the data augmentation method on the test dataset alongside the filtering methods using our modified VGG-16. Table 1 reported the experimental results of these filtering methods with and without data augmentation methods.
Table 1
Performance comparison of various filtering techniques with and without data augmentation on test data using our modified VGG-16 model.
| Augmentation | Filter name | Acc | AUC | MCC | SP | SN |
| False | Normal | 85.09 | 98.09 | 82.96 | 97.86 | 85.19 |
| Bilateral | 86.98 | 98.45 | 85.12 | 98.14 | 86.88 | |
| Gaussian blur | 83.62 | 97.79 | 81.34 | 97.66 | 83.60 | |
| Median blur | 86.19 | 98.14 | 84.14 | 98.02 | 86.33 | |
| Unsharp mask | 84.08 | 98.12 | 81.95 | 97.72 | 84.43 | |
| Histogram equalization + bilateral | 84.83 | 98.03 | 82.65 | 97.83 | 84.51 | |
| Sharpen + edge detection | 84.94 | 98.03 | 82.78 | 97.85 | 84.65 | |
| Histogram equalization + gamma contrast | 86.37 | 98.00 | 84.42 | 98.05 | 86.14 | |
| True | Normal | 90.24 | 99.26 | 88.90 | 98.60 | 90.31 |
| Bilateral | 91.00 | 99.56 | 91.60 | 98.95 | 92.42 | |
| Gaussian blur | 89.27 | 99.22 | 87.87 | 98.47 | 89.36 | |
| Median blur | 88.86 | 99.22 | 87.30 | 98.40 | 87.30 | |
| Unsharp mask | 90.35 | 99.35 | 89.07 | 98.62 | 89.08 | |
| Histogram equalization + bilateral | 89.63 | 99.00 | 88.18 | 98.52 | 89.43 | |
| Sharpen + edge detection | 90.93 | 99.39 | 89.66 | 98.70 | 90.77 | |
| Histogram equalization + gamma contrast | 86.81 | 98.94 | 85.12 | 98.12 | 86.83 | |
Note: The results demonstrate that data augmentation improved performance across all filters, with the bilateral filter outperforming the other filtering techniques.
Amongst all of our employed filtering methods, the bilateral filter demonstrated a consistently higher performance for both cases with and without applying the data augmentation technique. Without the application of data augmentation, this method achieved the highest accuracy, reaching 86.89%. It also reported the highest scores for other evaluation metrics, which are 98.45%, 85.12%, 98.14%, and 86.88% for AUC, MCC, specificity, and sensitivity, respectively. It has obtained 1.98% and 2.16% higher accuracy and MCC than the original dataset. The second-highest accuracy of 86.37% has been reported by applying the histogram equalization + Gamma contrast method, which is 0.61% lower than the bilateral filtering method. The Gaussian blur method reported the lowest accuracy of 83.62% among all other methods, which is 3.36% lower than the bilateral method.
Table 1 also shows that the data augmentation has significantly improved the performance of all filtering methods. The bilateral filtering method obtained the highest accuracy of 91.00% among all of the employed methods, which is 4.02% higher than without augmentation, demonstrating consistent performance in both cases (with and without data augmentation).
Other techniques, including original, Gaussian blur, median filter, unsharp mask, histogram equalization + bilateral, sharpen + edge detection, and histogram equalization + gamma contrast, also reported improved accuracy of 90.24%, 89.27%, 88.86%, 90.35%, 89.63%, 90.93%, and 86.81%, respectively. Data augmentation allows CNN models to learn with different image representations and provide better classification results. The lowest accuracy reported by the median filtering technique is also 3.77% and 1.88% higher than the original and bilateral datasets when augmentation is not applied. However, after applying the data augmentation technique, VGG-16 reported improved scores for other evaluation metrics as well. The best-performing bilateral filtering method reported scores of 99.56%, 91.60%, 98.95%, and 92.42% for AUC, MCC, specificity, and sensitivity, respectively, which are 1.11%, 6.48%, 0.82%, and 5.54% higher than without applying data augmentation.
We have also represented a plot graph in Figure 4 for result comparison while applying different filtering methods with data augmentation. From the graph, it is also evident that the line for bilateral filer is higher than other methods employed. The efficiency of the bilateral filter is attributed to its dual consideration of spatial distance and intensity similarity during the filtering process, which helps preserve sharp edges and important structures while effectively reducing noise. This method is particularly effective at reducing Gaussian noise, a common problem in image processing, while preserving edges and important details. This not only improves classification results by providing cleaner and more accurate image data but also enhances the overall robustness of the model. As a result, we have selected the bilateral filter as the final filtering technique for our proposed methods. In addition, we have illustrated a Grad-CAM [62] visualization using our feature extractor (modified VGG-16) to provide a visual explanation. Grad-CAM images provide a clear explanation of the most important regions considered by CNN. It generates a heatmap using the feature maps of the last convolution layer that determines the most relevant features of an image. Figure 5 represents the Grad-CAM images along with the original images for each class.
[figure(s) omitted; refer to PDF]
4.3. Efficiency of Addressing Label Noise
We conducted two distinct experiments to determine the effectiveness of addressing label noise with our proposed methods. Initially, we introduced 5% synthetic label noise on the selected training data to create a noisy dataset. To assess the impact of our proposed methods, we trained and validated four well-known ML models using 5-fold validation, including KNN, SVM, RF, and MLP, along with our proposed SEnse model. The experimental results on both the noisy dataset (with label noise) and the clean dataset (original training data) are presented in Tables 2 and 3. Furthermore, we compared our proposed methods for addressing label noise with outlier-based and loss-based approaches, as shown in Table 3. The main goal of this experiment is to determine how label noise negatively affects model performance and to assess the efficiency of our proposed methods in comparison to existing techniques.
Table 2
Performance of different classifiers on the test and validation data after introducing 5% synthetic label noise into the clean dataset.
| Classifier | Test Acc | Val Acc | MCC | AUC | SP | SN |
| KNN | 86.4 | 88.8 | 84.47 | 95.83 | 97.22 | 86.65 |
| SVM | 86.6 | 89.3 | 84.69 | 97.96 | 97.64 | 86.86 |
| RF | 86.7 | 89.2 | 84.80 | 98.01 | 97.75 | 86.97 |
| MLP | 84.3 | 88.5 | 82.09 | 97.60 | 95.05 | 84.68 |
| SEnse | 87.0 | — | 85.14 | 97.61 | 96.86 | 87.24 |
Table 3
Experimental results on the clean dataset (70% selected samples) after addressing label noise using different techniques.
| Technique | Classifier | Test Acc | Val Acc | MCC | AUC | SP | SN |
| Loss approach | KNN | 90.28 | 92.35 | 88.83 | 96.96 | 97.83 | 91.26 |
| SVM | 89.85 | 92.53 | 88.42 | 99.31 | 97.51 | 90.70 | |
| RF | 90.14 | 93.10 | 88.69 | 99.20 | 97.76 | 91.12 | |
| MLP | 88.28 | 92.21 | 86.94 | 98.00 | 97.33 | 89.33 | |
| SEnse | 90.71 | — | 89.33 | 99.22 | 97.72 | 91.59 | |
| pLOF | KNN | 88.57 | 90.61 | 86.86 | 96.47 | 97.15 | 89.73 |
| SVM | 89.86 | 92.53 | 88.42 | 99.29 | 97.51 | 90.70 | |
| RF | 90.14 | 93.07 | 88.69 | 99.19 | 97.77 | 91.11 | |
| MLP | 89.29 | 92.93 | 87.73 | 98.92 | 96.43 | 90.29 | |
| SEnse | 91.14 | — | 89.82 | 99.03 | 97.37 | 91.94 | |
| DBSCAN | KNN | 78.86 | 83.24 | 75.87 | 92.09 | 91.10 | 79.07 |
| SVM | 84.14 | 84.14 | 82.01 | 97.66 | 96.79 | 84.38 | |
| RF | 83.86 | 88.54 | 81.60 | 97.26 | 97.31 | 84.08 | |
| MLP | 79.00 | 82.29 | 76.14 | 95.57 | 98.09 | 79.21 | |
| SEnse | 85.14 | — | 83.03 | 96.30 | 96.13 | 85.14 | |
| Loss + pLOF | KNN | 91.43 | 93.03 | 90.15 | 98.36 | 97.48 | 92.22 |
| SVM | 91.28 | 93.64 | 89.99 | 99.28 | 97.22 | 92.05 | |
| RF | 91.14 | 93.39 | 89.80 | 99.34 | 97.48 | 91.95 | |
| MLP | 89.71 | 93.14 | 88.24 | 99.19 | 96.38 | 90.69 | |
| SEnse | 91.57 | — | 90.31 | 99.18 | 97.35 | 92.31 | |
Note: Various classifiers were evaluated on the test dataset using 5-fold cross-validation, and their performances were compared across the loss approach, pLOF, DBSCAN, and loss + pLOF.
As shown in Table 2, the SEnse model obtained the highest accuracy of 87.00% on the noisy dataset among all of the classifiers. The other classifiers, including KNN, SVM, RF, and MLP, reported accuracy of 86.4%, 86.6%, 86.7%, and 84.3%, respectively, which are 0.59%, 0.40%, 0.29%, and 2.70% lesser than the proposed SEnse model. Although this SEnse model exhibited superior performance than other classifiers in terms of most of the evaluation metrics, it reported a slightly lower specificity score than other classifiers except the MLP.
Table 3 presents the experimental results for addressing label noise using different methods, including the loss-based approach, outlier-based approaches (pLOF and DBSCAN), and a combination of both. The loss-based approach identifies noisy labeled samples based on uncertainty, while outlier-based approaches detect noisy samples based on their neighborhood representations. The results show that combining uncertainty and outlier-based methods improves model performance compared to using either the loss-based or outlier-based approaches alone. The average test accuracy of all classifiers for the loss approach, pLOF, DBSCAN, and the combined loss + pLOF approach is 89.85%, 89.8%, 82.2%, and 91.03%, respectively. The DBSCAN-based approach reported lower performance compared to the other methods and when dealing with 5% synthetic label noise. The other approaches significantly improved model performance, with our combined approach achieving the highest classification performance across all evaluation metrics. These results demonstrate the effectiveness of our proposed methods in addressing label noise, with superior performance overall.
All employed classifiers, along with our proposed SEnse model, demonstrated superior performance in terms of test accuracy, validation accuracy, MCC, AUC, specificity, and sensitivity on the clean dataset using our proposed method compared to the noisy dataset. KNN, SVM, RF, and MLP obtained an accuracy of 91.43%, 91.28%, 91.14%, and 89.71%, respectively, which are 5.3%, 4.68%, 3.44%, and 5.41% higher than on the noisy dataset. The lowest performance was reported by the MLP classifier on both noisy and clean datasets. The average MCC, AUC, specificity, and sensitivity of all classifiers on the noisy dataset are 84.23%, 97.40%, 97.10%, and 86.48%, respectively, which are still 4.01%, 1.57%, 0.49%, and 5.07% lower than the lowest performing classifier on the clean dataset. This clearly states that label noise greatly affects the model in a negative way during the training time. We have also included a t-SNE plot in Figure 6 to visually demonstrate the positive impact of addressing label noise. We have plotted our noisy and clean dataset with dimension reduction. In Figure 6, some outliers are presented on the noisy dataset, and the clean dataset does not contain that many outliers, which ensures that our method has successfully addressed the label noise issue in the training data.
[figure(s) omitted; refer to PDF]
However, the SEnse model demonstrated consistently superior performance in both noisy and clean datasets. It reported the highest accuracy of 91.57% among all other classifiers, which is 0.14%, 0.29%, 0.43%, and 1.86% higher than the KNN, SVM, RF, and MLP, respectively, on the clean dataset. It has also reported the highest score for other evaluation metrics. Notably, the performance of this SEnse model after addressing label noise has increased significantly, which is 4.57%, 5.17%, 1.57%, 0.49%, and 5.07% higher than on the noisy dataset in terms of accuracy, MCC, AUC, specificity, and sensitivity, respectively. The SEnse model combines the prediction probabilities into a meta-feature set and classifies this meta-feature set with a metamodel, and this process of this model allows itself to maintain its robust performance. In addition, we have presented a violin plot to illustrate the performance of our proposed SEnse model for both clean and noisy datasets in Figure 7. This Figure 7 represents that the plots for the clean dataset are wider than those for the noisy dataset, which ensures that addressing label noise enhances the performance of our proposed model in terms of all the evaluation metrics. From the experimental results and our investigation, it is clearly demonstrated that the SEnse model is the best-performing, and addressing the label noise with our proposed method can significantly benefit our classifiers.
[figure(s) omitted; refer to PDF]
4.4. Efficiency of Feature Selection
In this section, we have analyzed the experimental results of three feature selection techniques (LIME, MDA, and RFE) on the selected clean dataset after addressing label noise. The purpose of these experiments was to investigate the influence and efficiency of each technique with different portions of features (20%, 40%, 50%, 60%, and 80%). The performance of each feature selection technique allows us to determine the most effective technique with the optimal portion of features. Table 4 presents the experimental results.
Table 4
Experimental results for feature selection techniques with various combinations of features on the clean dataset.
| Technique | No. features (%) | Test Acc | MCC | AUC | SP | SN |
| LIME | 20 | 91.29 | 89.99 | 99.27 | 97.22 | 92.06 |
| 40 | 91.57 | 90.32 | 99.02 | 97.07 | 92.31 | |
| 50 | 91.86 | 90.63 | 99.13 | 97.48 | 92.57 | |
| 60 | 91.14 | 89.82 | 98.76 | 97.48 | 91.94 | |
| 80 | 91.28 | 89.98 | 89.99 | 97.22 | 92.07 | |
| MDA | 20 | 91.43 | 90.14 | 99.29 | 97.36 | 92.21 |
| 40 | 91.85 | 90.67 | 99.04 | 97.34 | 92.60 | |
| 50 | 90.71 | 89.34 | 99.03 | 96.94 | 91.55 | |
| 60 | 90.14 | 88.68 | 98.98 | 96.92 | 91.04 | |
| 80 | 90.43 | 89.01 | 98.60 | 96.92 | 91.32 | |
| RFE | 20 | 92.00 | 90.80 | 99.18 | 97.63 | 92.71 |
| 40 | 92.43 | 91.31 | 99.35 | 97.78 | 93.07 | |
| 50 | 92.29 | 91.13 | 99.31 | 97.65 | 92.93 | |
| 60 | 92.14 | 90.96 | 99.16 | 97.78 | 92.79 | |
| 80 | 92.14 | 90.96 | 99.04 | 97.78 | 92.79 | |
These three techniques have reported three distinct types of experimental results on the same dataset due to their underlying algorithms and architectures. Table 4 represents that the average test accuracy of LIME, MDA, and RFE is 91.42%, 90.91%, and 92.19%, respectively, which are totally different from one another. Among them, the RFE technique reported the highest average test accuracy, which is 0.77% and 1.28% higher than the LIME and MDA. RFE also reported the highest score for other evaluation metrics, including MCC, AUC, specificity, and sensitivity. A significantly higher score for all evaluation metrics is reported when 40% of the features are selected by the RFE compared to all other techniques and their respective portions of selected features.
LIME and MDA have reported their highest accuracies of 91.86% and 91.85%, respectively, for 50% and 40% of the selected features among their respective portions of selected features. Initially, for 20% of the selected features, all employed techniques showed a decreasing trend in the score for all evaluation metrics on our clean dataset. Afterward, all techniques demonstrated an increase in classification accuracy and other evaluation metrics for up to 50% of the selected features, followed by a subsequent decline at 60% and 80%. This pattern suggests that approximately half of the features have a more significant influence on the final classification compared to both the lower and higher portions of the features. RFE with 40% features obtained the highest accuracy of 92.43%, which is 1.19%, 1.52%, and 0.24% higher than the average accuracy of LIME, MDA, and RFE, respectively. It has also obtained the highest scores of 91.31%, 99.35%, 97.78%, and 93.07% for MCC, AUC, specificity, and sensitivity as well. The second-highest score is also reported by the RFE for 50% of the selected features among all three techniques. The highest accuracy of LIME and MDA is still 0.56% and 0.55% lower than the highest score of RFE. RFE with 40% selected features also exhibited significantly (0.68% and 0.64%) higher MCC than the highest score obtained by the LIME and MDA. However, 40% of selected features with RFE demonstrated superior performances in terms of all the evaluation metrics compared to other techniques. We also represent the bar plot for all the feature selection techniques in Figure 8, where RFE demonstrated the highest classification results among them.
[figure(s) omitted; refer to PDF]
RFE employs a RFE strategy, enabling it to efficiently discard redundant features and retain only those with a genuine impact on the final classification. Based on our experimental results and investigation, we determine RFE as the preferred feature selector and 40% of selected features as the optimal proportion of features for our proposed model.
4.5. Web Application for Real-Time Prediction
We have deployed our proposed model to create a web application that provides real-time classification of colorectal cancer using histological images. To facilitate this development, our web application has been crafted using an open-source Python package known as Gradio [63], which has been widely utilized to build ML models such as APIs. This package allows the creation of a customized user interface (UI) and serves as a web application for real-time classification.
Initially, the user is required to upload an image via the UI that serves as the input for obtaining the classification result. This input image then undergoes resizing and applying a bilateral filter. Subsequently, it is passed through a pretrained CNN to extract a feature vector comprising 1024 distinctive features. These extracted features are further processed through the RFE feature selector, and 40% features are selected to get the most influential features. This selected feature is then passed into the loaded SEnse model for prediction. The web application outputs the predicted class and predicted probabilities for each class alongside the input image. After developing the website, we deployed it on an open platform, namely, Hugging Face [64]. Hugging Face offers the advantage of providing a publicly accessible URL, enabling users to access web applications hosted on this platform easily. The website can be accessed by clicking on the following link: https://huggingface.co/spaces/NazmulPanto/CRC.
4.6. Execution Time Analysis
In this section, we analyzed the execution time of state-of-the-art ML classifiers, including our proposed SEnse model, as well as the performance of different CNN models with our proposed framework for CRC classification. Our framework combines a modified VGG-16 for feature extraction and the SEnse model for classification, while traditional CNN architectures use fully connected layers for classification.
The experimental results in Table 5 show that the SEnse model has a higher execution time compared to other classifiers, including KNN, SVM, RF, and MLP. However, the SEnse model consists of five components: four base learners and a meta-learner. Thus, the execution time reflects the cumulative time taken by these five models. We conducted the analysis both before and after applying a feature selection method. Before feature selection, the SEnse model’s execution time was 0.0459 s, which is 27.83% faster than the total execution time of the four other classifiers combined (0.0636 s). Applying feature selection effectively reduced the execution time across all classifiers, highlighting the computational efficiency of the technique. KNN, SVM, RF, MLP, and SEnse improved by 77.27%, 43.14%, 63.82%, 52.68%, and 72.11% faster execution. These results demonstrate a significant improvement in computational efficiency after feature selection.
Table 5
Execution time analysis for both ML classifiers and CNN models with our proposed model (measured in seconds).
| ML classifiers | CNN models | |||
| Model | Time (1024 feat) (s) | Time (40% feat) (s) | Model | Time (s) |
| KNN | 0.0176 | 0.0040 | Inception-V3 | 0.4851 |
| SVM | 0.0102 | 0.0058 | ResNet-50 | 0.3864 |
| RF | 0.0246 | 0.0089 | DenseNet-121 | 0.6556 |
| MLP | 0.0112 | 0.0053 | Xception | 0.3259 |
| SEnse | 0.0459 | 0.0128 | VGG-16 + SEnse | 0.1203 |
We also compared our overall framework with several state-of-the-art CNN models that perform both feature extraction and classification. While traditional CNN models use convolutional layers for feature extraction followed by fully connected layers for classification, our proposed framework utilizes a modified VGG-16 model for feature extraction and employs the SEnse model for classification. The experimental results indicate that our framework achieves the lowest execution time among all tested models. Specifically, our proposed model provided 75.20%, 68.87%, 81.65%, and 63.11% faster execution times compared to Inception-V3, ResNet-50, DenseNet-121, and Xception, respectively. These findings confirm that our framework is more computationally efficient than the other CNN models while maintaining robust performance.
4.7. Statistical Analysis
In this section, we conducted a statistical analysis using a paired t-test to assess the significance of our methodology on model performance. First, we applied the test across all evaluation metrics to compare model performance with and without data augmentation, using results obtained from multiple filtering techniques. This comparison evaluates the impact of data augmentation on model performance. After that, we assessed the effect of label noise by comparing each classifier’s performance on noisy and clean datasets using the same metrics. Finally, we compared our proposed method for addressing label noise with three alternative techniques across all evaluation metrics to assess its significance. A p value less than 0.05 indicates a statistically significant difference, while a p value greater than 0.05 suggests that the difference is not statistically significant.
Table 6 shows that the statistical test on data augmentation techniques demonstrated a significant improvement in model performance across all evaluation metrics. The p values for each metric are far below 0.05, providing strong evidence to reject the null hypothesis and indicating that the observed differences are unlikely due to random chance. Notably, the highly significant p value for AUC reflects a reliable increase in AUC scores following the application of data augmentation. Addressing label noise also led to considerable improvements in both individual and ensemble classifiers. The KNN, RF, MLP, and SEnse models reported p values of 0.0223, 0.0459, 0.0195, and 0.0260, respectively. All of these p values are less than the 0.05 threshold, which confirms a significant difference when trained with clean data. However, the SVM model gave a p value of 0.0511, which is slightly higher than the threshold and suggests that the difference is not statistically significant.
Table 6
Statistical analysis of model performance using paired t-tests across evaluation metrics under data augmentation, label noise handling, and employed methods for addressing label noise (with p values).
| Data augmentation | Addressing label noise | Methods | |||||
| Metrics | Loss | pLOF | DBSCAN | ||||
| Metrics | P value | Model | P value | P value | P value | P value | |
| ACC | 0.0004 | KNN | 0.0226 | ACC | 0.0005 | 0.0526 | 0.0018 |
| AUC | ∼0.001 | SVM | 0.0511 | AUC | 0.1664 | 0.2372 | 0.0158 |
| MCC | 0.0002 | RF | 0.0459 | MCC | 0.0002 | 0.0537 | 0.0019 |
| SP | 0.0003 | MLP | 0.0195 | SP | 0.0241 | 0.6042 | 0.3934 |
| SN | 0.0008 | SEnse | 0.0260 | SN | 0.0014 | 0.0500 | 0.0014 |
Our proposed method for addressing label noise also showed significant performance improvements when compared to the three alternative techniques: a loss-based approach, the pLOF method, and DBSCAN. In these comparisons, we assessed each technique’s performance across all classifiers and evaluation metrics. Compared to the loss-based approach, our method reported significantly better results, with p values below 0.05 for all metrics except AUC. While the pLOF method performed well on AUC and SP, our approach demonstrated higher significance on ACC, MCC, and SN. In comparison to DBSCAN, our method produced lower p values across all metrics, except for SP. In conclusion, the statistical analysis confirms that our methodologies positively and significantly influence model performance. These evaluations were not limited to our proposed model but were also validated against state-of-the-art models, further demonstrating improved performance through statistical significance.
4.8. Comparison With Existing Methods
In this section, we conducted a comparative analysis between our proposed model and pre-existing models specifically designed for the classification of colorectal cancer using the Colorectal Histology MNIST dataset. In order to determine the efficiency of our proposed model, the evaluation criterion employed was the accuracy score. We assessed the performance of our proposed model in comparison to the models developed by Jakob et al. [17], Rizalputri et al. [65], Murashova and Colbry [66], Kumar et al. [67], Tripathi et al. [68], and Ohata et al. [69] with respect to their respective accuracy metrics. The comparative results are presented in Table 7.
Table 7
Comparison of our proposed model with existing models trained and evaluated on the Colorectal Histology MNIST dataset.
| Author | Techniques | Accuracy (in %) |
| Rizalputri et al. [65] | Augmented subsample labeling + transfer learning | 70.10 |
| Kumar et al. [67] | Few-shot learning image generation + transformer-based controllable fusion block (CBF) with cross-attention | 72.50 |
| Lavita et al. [64] | Feature extraction using CNN + classification using ML | 82.20 |
| Kather et al. [17] | Texture analysis for feature extraction + classification with ML classifiers | 87.40 |
| Tripathi et al. [68] | Classification using ML classifiers | 90.67 |
| Ohata et al. [69] | Feature extraction using CNN + classification using ML classifiers | 91.99 |
| Proposed | Feature extraction using VGG-16 + addressed label noise + reliable classification with stacking-ensemble | 92.43 |
As illustrated in Table 7, Jakob et al. [17], Rizalputri et al. [65], Kumar et al. [67], and Murashova and Colbry [66] reported relatively low accuracy scores in their respective studies. Kather et al. [17] achieved the highest accuracy, reaching 87.40% for their ML-based approaches which is 5.03% lower than our proposed model. Rizalputri et al. [65] demonstrated the highest accuracy of 82.20% in a binary class classification task using CNN, which is 10.23% lower than the accuracy obtained by our proposed model. Murashova and Colbry [66] reported 22.33% lower accuracy than our proposed model. Tripathi et al. [68] obtained the highest accuracy of 90.67%, which is still 1.76% lower than ours. Ohata et al. [69] reported slightly lower accuracy than our model, but they employed transfer learning for feature extraction and used a single ML classifier. In comparison, our advanced methodology not only ensures robust performance but also provides more reliable predictions using SEnse techniques. Our proposed hybrid model achieved the highest accuracy among all existing models, with an impressive accuracy of 92.43%. This superior performance is attributed to a series of techniques and strategies that we meticulously developed and implemented. Initially, we employed preprocessing techniques to learn with the best possible image representation in the learning process during the training phase. Furthermore, we addressed the label noise issue for mislabeled samples to enhance the robustness of our proposed model. In addition, we explored three different feature selection techniques using various combinations to determine the most effective method for selecting features extracted from VGG-16. It ensures enhanced performance with less computational cost. Finally, the incorporation of an ensemble technique that involves the stacking of multiple ML models allows our model to become more robust and reliable. This comprehensive strategy in our proposed model outperforms all existing models in the accurate classification of colorectal cancer using the Colorectal Histology MNIST dataset.
4.9. Limitations and Future Works
Our proposed methodology demonstrated robust performance compared to several state-of-the-art classifiers and existing studies in the literature for addressing label noise and classification. However, this study was specifically designed and developed for classifying CRC using histopathological images, with training and evaluation performed on the Colorectal Histology MNIST dataset [43]. Due to limited computational resources, we could not evaluate our proposed methodology on larger-scale medical image datasets. In addition, while we developed a demo web application to showcase its practical utility in assisting CAD systems, we were unable to gather feedback from clinical experts, which could have provided valuable insights for refining the system.
In the future, we aim to enhance our approach by exploring advanced uncertainty quantification techniques to improve the robustness of our model against label noise. Our current methodology combines a loss-based uncertainty approach with outlier detection; however, we recognize the potential for further development in this area. We also intend to investigate ensemble learning strategies to ensure trustworthy predictions and enhance model reliability. By evaluating our methodology on a wider variety of medical image datasets, we aspire to enable robust training and reliable predictions, ultimately facilitating practical implementation in clinical settings and supporting clinical decision-making through visual explanations.
5. Conclusions
The diagnosis of CRC using histological images is quite a challenging and time-consuming task, primarily due to the scarcity of expert pathologists and the labor-intensive nature of the process. In this paper, we proposed a robust framework that combines both ML and DL approaches with a SEnse technique to classify CRC from histological slides. After analyzing different filtering methods, we selected the bilateral filtering method for input image conversion that demonstrated the highest classification accuracy among other methods. We also applied data augmentation and rescaling in the preprocessing process to learn with different representations and faster convergence. A modified VGG-16 was proposed for the feature extraction process, where transfer learning was utilized and the model was trained with parameter tuning. We have addressed the label noise issue and selected clean samples for training our final model, which allows us to overcome the irrelevant bias of the model during learning time. After being trained with clean samples, our model optimized the bias and assured a trustworthy diagnosis. A RF-RFE technique with 40% features has been selected as the feature selection technique based on its superior performance on the test data. Our SEnse technique employs 4 ML classifiers as base learners and combines their prediction probabilities into a meta-feature set. This distinctive process within our proposed framework ensures the reliability of the model for diagnosing cancer diseases. Finally, we developed a web application for demonstrating a real-time prediction using an input histological CRC slide image. It is capable of providing accurate and trustworthy classification, ensuring reliability to enable faster cancer diagnosis and advance healthcare systems in rural areas.
Author Contributions
S.M.H.M. and M.T.A. conceptualized the research study. S.M.H.M., M.T.A., and I.Z.T. performed THE methodology. M.T.A. and M.N.I. developed the software. S.M.H.M., K.O.M.G., and D.N. provided help on validation. I.Z.T., M.T.A., and M.N.I. wrote the manuscript. S.M.H.M. and K.O.M. performed writing–review and editing on the draft manuscript. S.M.H.M. supervised and provided advice on the research study. All authors have read and agreed to the final version of the manuscript. I.Z.T. and M.N.I. contributed equally.
Funding
The fund was supported by the Multimedia University (MMU) IR fund (Project ID MMUI/220041).
[1] M. S. Hossain, H. Karuniawati, A. A. Jairoun, "Colorectal Cancer: A Review of Carcinogenesis, Global Epidemiology, Current Challenges, Risk Factors, Preventive and Treatment Strategies," Cancers, vol. 14 no. 7,DOI: 10.3390/cancers14071732, 2022.
[2] I. Mármol, C. Sánchez-de-Diego, A. Pradilla Dieste, E. Cerrada, M. J. Rodriguez Yoldi, "Colorectal Carcinoma: A General Overview and Future Perspectives in Colorectal Cancer," International Journal of Molecular Sciences, vol. 18 no. 1,DOI: 10.3390/ijms18010197, 2017.
[3] J. Ferlay, M. Colombet, I. Soerjomataram, "Cancer Incidence and Mortality Patterns in Europe: Estimates for 40 Countries and 25 Major Cancers in 2018," European Journal of Cancer, vol. 103, pp. 356-387, DOI: 10.1016/j.ejca.2018.07.005, 2018.
[4] S. Ghosh, A. Bandyopadhyay, S. Sahay, R. Ghosh, I. Kundu, K. C. Santosh, "Colorectal Histology Tumor Detection Using Ensemble Deep Neural Network," Engineering Applications of Artificial Intelligence, vol. 100,DOI: 10.1016/j.engappai.2021.104202, 2021.
[5] P. Vega, F. Valentin, J. Cubiella, "Colorectal Cancer Diagnosis: Pitfalls and Opportunities," World Journal of Gastrointestinal Oncology, vol. 7 no. 12,DOI: 10.4251/wjgo.v7.i12.422, 2015.
[6] A. Banwari, N. Sengar, M. K. Dutta, "Image Processing Based Colorectal Cancer Detection in Histopathological Images," International Journal of E-Health and Medical Communications, vol. 9 no. 2,DOI: 10.4018/ijehmc.2018040101, 2018.
[7] M. Y. Ahmad, A. Mohamed, Y. A. Yusof, S. A. Ali, "Colorectal Cancer Image Classification Using Image Pre-Processing and Multilayer Perceptron," 2012 International Conference on Computer & Information Science (ICCIS), vol. 1, pp. 275-280, 2012.
[8] J. G. Elmore, G. M. Longton, P. A. Carney, "Diagnostic Concordance Among Pathologists Interpreting Breast Biopsy Specimens," JAMA, vol. 313 no. 11, pp. 1122-1132, DOI: 10.1001/jama.2015.1405, 2015.
[9] K. Bera, K. A. Schalper, D. L. Rimm, V. Velcheti, A. Madabhushi, "Artificial Intelligence in Digital Pathology—New Tools for Diagnosis and Precision Oncology," Nature Reviews Clinical Oncology, vol. 16 no. 11, pp. 703-715, DOI: 10.1038/s41571-019-0252-y, 2019.
[10] L. D. Tamang, B. W. Kim, "Deep Learning Approaches to Colorectal Cancer Diagnosis: A Review," Applied Sciences, vol. 11 no. 22,DOI: 10.3390/app112210982, 2021.
[11] T. Tamaki, J. Yoshimuta, M. Kawakami, "Computer-Aided Colorectal Tumor Classification in NBI Endoscopy Using Local Features," Medical Image Analysis, vol. 17 no. 1, pp. 78-100, DOI: 10.1016/j.media.2012.08.003, 2013.
[12] Y. Kominami, S. Yoshida, S. Tanaka, "Computer-Aided Diagnosis of Colorectal Polyp Histology by Using a Real-Time Image Recognition System and Narrow-Band Imaging Magnifying Colonoscopy," Gastrointestinal Endoscopy, vol. 83 no. 3, pp. 643-649, DOI: 10.1016/j.gie.2015.08.004, 2016.
[13] A. F. Swager, F. van der Sommen, S. R. Klomp, "Computer-Aided Detection of Early Barrett’s Neoplasia Using Volumetric Laser Endomicroscopy," Gastrointestinal Endoscopy, vol. 86 no. 5, pp. 839-846, DOI: 10.1016/j.gie.2017.03.011, 2017.
[14] M. Min, S. Su, W. He, Y. Bi, Z. Ma, Y. Liu, "Computer-Aided Diagnosis of Colorectal Polyps Using Linked Color Imaging Colonoscopy to Predict Histology," Scientific Reports, vol. 9 no. 1,DOI: 10.1038/s41598-019-39416-7, 2019.
[15] S. V. Ambedkar, A Machine Learning Approach to Colorectal Cancer Screening Doctoral Dissertation, .
[16] M. C. Hornbrook, R. Goshen, E. Choman, "Early Colorectal Cancer Detected by Machine Learning Model Using Gender, Age, and Complete Blood Count Data," Digestive Diseases and Sciences, vol. 62 no. 10, pp. 2719-2727, DOI: 10.1007/s10620-017-4722-8, 2017.
[17] J. N. Kather, C. A. Weis, F. Bianconi, "Multi-Class Texture Analysis in Colorectal Cancer Histology," Scientific Reports, vol. 6 no. 1,DOI: 10.1038/srep27988, 2016.
[18] A. M. Alqudah, A. Alqudah, "Improving Machine Learning Recognition of Colorectal Cancer Using 3D GLCM Applied to Different Color Spaces," Multimedia Tools and Applications, vol. 81 no. 8, pp. 10839-10860, DOI: 10.1007/s11042-022-11946-9, 2022.
[19] N. K. Chauhan, K. Singh, "A Review on Conventional Machine Learning vs Deep Learning," 2018 International Conference on Computing, Power and Communication Technologies (GUCON), pp. 347-352, 2018.
[20] D. Bychkov, N. Linder, R. Turkki, "Deep Learning Based Tissue Analysis Predicts Outcome in Colorectal Cancer," Scientific Reports, vol. 8 no. 1,DOI: 10.1038/s41598-018-21758-3, 2018.
[21] A. S. Sakr, N. F. Soliman, M. S. Al-Gaashani, P. Pławiak, A. A. Ateya, M. Hammad, "An Efficient Deep Learning Approach for Colon Cancer Detection," Applied Sciences, vol. 12 no. 17,DOI: 10.3390/app12178450, 2022.
[22] F. Wilm, M. Benz, V. Bruns, "Fast Whole-Slide Cartography in Colon Cancer Histology Using Superpixels and CNN Classification," Journal of Medical Imaging, vol. 9 no. 2,DOI: 10.1117/1.jmi.9.2.027501, 2022.
[23] A. Moyes, R. Gault, K. Zhang, J. Ming, D. Crookes, J. Wang, "Multi-Channel Auto-Encoders for Learning Domain Invariant Representations Enabling Superior Classification of Histopathology Images," Medical Image Analysis, vol. 83,DOI: 10.1016/j.media.2022.102640, 2023.
[24] A. B. Gavade, R. B. Nerli, S. Ghagane, P. A. Gavade, V. S. Bhagavatula, "Cancer Cell Detection and Classification From Digital Whole Slide Image," Smart Technologies in Data Science and Communication: Proceedings of SMART-DSC, pp. 289-299, 2023.
[25] Y. Xu, Z. Jia, L. B. Wang, "Large Scale Tissue Histopathology Image Classification, Segmentation, and Visualization via Deep Convolutional Activation Features," BMC Bioinformatics, vol. 18, pp. 281-287, DOI: 10.1186/s12859-017-1685-x, 2017.
[26] S. Cascianelli, R. Bello-Cerezo, F. Bianconi, "Dimensionality Reduction Strategies for Cnn-Based Classification of Histopathological Images," Intelligent Interactive Multimedia Systems and Services, pp. 21-30, 2017.
[27] F. Ciompi, O. Geessink, B. E. Bejnordi, "The Importance of Stain Normalization in Colorectal Tissue Classification With Convolutional Networks," 2017 IEEE 14th International Symposium on Biomedical Imaging, pp. 160-163, 2017.
[28] N. Bayramoglu, J. Heikkilä, "Transfer Learning for Cell Nuclei Classification in Histopathology Images," Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, pp. 532-539, 2016.
[29] E. F. Ohata, J. V. Chagas, G. M. Bezerra, M. M. Hassan, V. H. de Albuquerque, P. P. Filho, "A Novel Transfer Learning Approach for the Classification of Histological Images of Colorectal Cancer," The Journal of Supercomputing, vol. 77 no. 9, pp. 9494-9519, DOI: 10.1007/s11227-020-03575-6, 2021.
[30] M. J. Tsai, Y. H. Tao, "Deep Learning Techniques for the Classification of Colorectal Cancer Tissue," Electronics, vol. 10 no. 14,DOI: 10.3390/electronics10141662, 2021.
[31] K. Damkliang, T. Wongsirichot, P. Thongsuksai, "Tissue Classification for Colorectal Cancer Utilizing Techniques of Deep Learning and Machine Learning. Biomedical Engineering: Applications," Biomedical Engineering: Applications, Basis and Communications, vol. 33 no. 03,DOI: 10.4015/s1016237221500228, 2021.
[32] S. Manivannan, W. Li, J. Zhang, E. Trucco, S. J. McKenna, "Structure Prediction for Gland Segmentation With Hand-Crafted and Deep Convolutional Features," IEEE Transactions on Medical Imaging, vol. 37 no. 1, pp. 210-221, DOI: 10.1109/tmi.2017.2750210, 2018.
[33] H. Chen, C. Li, X. Li, "IL-MCAM: An Interactive Learning and Multi-Channel Attention Mechanism-Based Weakly Supervised Colorectal Histopathology Image Classification Approach," Computers in Biology and Medicine, vol. 143,DOI: 10.1016/j.compbiomed.2022.105265, 2022.
[34] K. S. Wang, G. Yu, C. Xu, "Accurate Diagnosis of Colorectal Cancer Based on Histopathology Images Using Artificial Intelligence," BMC Medicine, vol. 19, pp. 76-82, DOI: 10.1186/s12916-021-01942-5, 2021.
[35] C. Ho, Z. Zhao, X. F. Chen, "A Promising Deep Learning-Assistive Algorithm for Histopathological Screening of Colorectal Cancer," Scientific Reports, vol. 12 no. 1,DOI: 10.1038/s41598-022-06264-x, 2022.
[36] A. Riasatian, M. Babaie, D. Maleki, "Fine-Tuning and Training of Densenet for Histopathology Image Representation Using Tcga Diagnostic Slides," Medical Image Analysis, vol. 70,DOI: 10.1016/j.media.2021.102032, 2021.
[37] M. Yildirim, A. Cinar, "Classification With Respect to Colon Adenocarcinoma and Colon Benign Tissue of Colon Histopathological Images With a New CNN Model: MA_ColonNET," International Journal of Imaging Systems and Technology, vol. 32 no. 1, pp. 155-162, DOI: 10.1002/ima.22623, 2022.
[38] A. Kumar, A. Vishwakarma, V. Bajaj, "Crccn-Net: Automated Framework for Classification of Colorectal Tissue Using Histopathological Images," Biomedical Signal Processing and Control, vol. 79,DOI: 10.1016/j.bspc.2022.104172, 2023.
[39] J. D. Jara, S. Bowen, "Learning Curve Analysis on Adam, Sgd, and Adagrad Optimizers on a Convolutional Neural Network Model for Cancer Cells Recognition. ADCAIJ," Advances in Distributed Computing and Artificial Intelligence Journal, vol. 11 no. 3, pp. 263-283, 2022.
[40] M. A. Zeid, K. El-Bahnasy, S. E. Abo-Youssef, "Multiclass Colorectal Cancer Histology Images Classification Using Vision Transformers," 2021 Tenth International Conference on Intelligent Computing and Information Systems (ICICIS), vol. 5, pp. 224-230, DOI: 10.1109/icicis52592.2021.9694125, 2021.
[41] H. C. Reis, V. Turk, "Transfer Learning Approach and Nucleus Segmentation With Medclnet Colon Cancer Database," Journal of Digital Imaging, vol. 36 no. 1, pp. 306-325, DOI: 10.1007/s10278-022-00701-z, 2022.
[42] M. Khazaee Fadafen, K. Rezaee, "Ensemble-Based Multi-Tissue Classification Approach of Colorectal Cancer Histology Images Using a Novel Hybrid Deep Learning Framework," Scientific Reports, vol. 13 no. 1,DOI: 10.1038/s41598-023-35431-x, 2023.
[43] J. N. Kather, F. G. Zöllner, F. Bianconi, "Collection of Textures in Colorectal Cancer Histology," 2016.
[44] K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," 2014.
[45] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei, "Imagenet: A Large-Scale Hierarchical Image Database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
[46] "PyTorch. Torch.tensor.register_Hook," . https://pytorch.org/docs/stable/generated/torch.Tensor.register_hook.html
[47] L. Jiang, Z. Zhou, T. Leung, L. J. Li, L. Fei-Fei, "Mentornet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels," InInternational Conference on Machine Learning, pp. 2304-2313, 2018.
[48] B. Han, G. Niu, X. Yu, "Sigua: Forgetting May Make Learning With Noisy Labels More Robust," InInternational Conference on Machine Learning, vol. 21, pp. 4006-4016, 2020.
[49] X. Xia, T. Liu, B. Han, "Sample Selection With Uncertainty of Losses for Learning With Noisy Labels," arXiv preprint arXiv:2106.00445, 2021.
[50] H. P. Kriegel, P. Kröger, E. Schubert, A. Zimek, "LoOP: Local Outlier Probabilities," Proceedings of the 18th ACM Conference on Information and Knowledge mAnagement, vol. 2, pp. 1649-1652, 2009.
[51] M. M. Breunig, H. P. Kriegel, R. T. Ng, J. Sander, "LOF: Identifying Density-Based Local Outliers," ACM SIGMOD Record, vol. 29 no. 2, pp. 93-104, DOI: 10.1145/335191.335388, 2000.
[52] "Recursive Feature Elimination (RFE) for Feature Selection in Python," . https://machinelearningmastery.com/rfe-feature-selection-in-python/
[53] W. Lian, G. Nie, B. Jia, D. Shi, Q. Fan, Y. Liang, "An Intrusion Detection Method Based on Decision Tree-Recursive Feature Elimination in Ensemble Learning," Mathematical Problems in Engineering, vol. 2020,DOI: 10.1155/2020/2835023, 2020.
[54] T. G. Dietterich, "Ensemble Learning. The Handbook of Brain Theory and," Neural Networks, vol. 2 no. 1, pp. 110-125, 2002.
[55] F. Divina, A. Gilson, F. Goméz-Vela, M. García Torres, J. F. Torres, "Stacking Ensemble Learning for Short-Term Electricity Consumption Forecasting," Energies, vol. 11 no. 4,DOI: 10.3390/en11040949, 2018.
[56] R. Sikora, "A Modified Stacking Ensemble Machine Learning Algorithm Using Genetic Algorithms," InHandbook of Research on Organizational Transformations Through Big Data Analytics, pp. 43-53, 2015.
[57] M. A. Chandra, S. S. Bedi, "Survey on SVM and Their Application in Image Classification," International Journal of Information Technology, vol. 13 no. 5,DOI: 10.1007/s41870-017-0080-1, 2021.
[58] S. Zhang, X. Li, M. Zong, X. Zhu, R. Wang, "Efficient kNN Classification With Different Numbers of Nearest Neighbors," IEEE Transactions on Neural Networks and Learning Systems, vol. 29 no. 5, pp. 1774-1785, DOI: 10.1109/tnnls.2017.2673241, 2018.
[59] L. Breiman, "Random Forests," Machine Learning, vol. 45 no. 1,DOI: 10.1023/a:1010933404324, 2001.
[60] M. Desai, M. Shah, "An Anatomization on Breast Cancer Detection and Diagnosis Employing Multi-Layer Perceptron Neural Network (MLP) and Convolutional Neural Network (CNN)," Clinical eHealth, vol. 4,DOI: 10.1016/j.ceh.2020.11.002, 2021.
[61] M. P. LaValley, "Logistic Regression," Circulation, vol. 117 no. 18, pp. 2395-2399, DOI: 10.1161/circulationaha.106.682658, 2008.
[62] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, "Grad-Cam: Visual Explanations From Deep Networks via Gradient-Based Localization," In Proceedings of the IEEE International Conference on Computer Vision, pp. 618-626, 2017.
[63] "Gradio," . https://www.gradio.app/
[64] "Hugging Face," . https://huggingface.co/
[65] L. N. Rizalputri, T. Pranata, N. S. Tanjung, H. M. Auliya, S. Harimurti, I. Anshori, "Colorectal Histology CSV Multi-Classification Accuracy Comparison Using Various Machine Learning Models," In 2019 International Conference on Electrical Engineering and Informatics (ICEEI), pp. 58-62, 2019.
[66] G. A. Murashova, D. Colbry, "GM FASST: General Method for Labeling Augmented Sub-Sampled Images From a Small Data Set for Transfer Learning," Machine Learning With Applications, vol. 6,DOI: 10.1016/j.mlwa.2021.100168, 2021.
[67] A. Kumar, A. K. Bhunia, S. Narayan, "Cross-Modulated Few-Shot Image Generation for Colorectal Tissue Classification," InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 128-137, 2023.
[68] A. Tripathi, A. Misra, K. Kumar, B. K. Chaurasia, "Optimized Machine Learning for Classifying Colorectal Tissues," SN Computer Science, vol. 4 no. 5,DOI: 10.1007/s42979-023-01882-2, 2023.
[69] E. F. Ohata, J. V. Chagas, G. M. Bezerra, M. M. Hassan, V. H. de Albuquerque, P. P. Filho, "A Novel Transfer Learning Approach for the Classification of Histological Images of Colorectal Cancer," The Journal of Supercomputing, vol. 77 no. 9, pp. 9494-9519, DOI: 10.1007/s11227-020-03575-6, 2021.
Copyright © 2025 Ishrat Zahan Tani et al. Applied Computational Intelligence and Soft Computing published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution License (the “License”), which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/