Introduction
Colorectal cancer (CRC) is one of the most prevalent and deadly malignancies worldwide, with over 1.85 million new cases and 850,000 deaths annually1. By 2040, the global burden of CRC is projected to reach 3.2 million cases. Histopathological examination remains the gold standard for CRC diagnosis, relying heavily on pathologists’ subjective interpretation of tissue samples. However, this process is time-consuming, labor-intensive, and prone to human error2. The development of Computer-Aided Diagnosis (CAD) systems for automated histopathological image classification is therefore critical to improving diagnostic efficiency and accuracy, enabling timely treatment planning3.
Recent works have shown that deep learning, especially transfer learning and ensemble learning, can significantly improve classification performance in cancer histopathology, including breast and gastric cancers4. Despite significant advancements in deep learning, particularly convolutional neural networks (CNNs), the application of these techniques to medical image analysis faces several challenges5: Limited annotated data: acquiring high-quality annotated histopathological images is costly and labor-intensive, resulting in small datasets that hinder model training6. Feature complexity: CRC histopathological images exhibit high variability in cellular morphology, fine-grained structures, and noise, posing significant challenges for automated analysis. These images typically contain multi-scale features (cellular details at different resolutions), complex backgrounds (unstained regions or noisy labels), and fine-grained differences across subtypes. An effective model must be capable of focusing on disease-relevant regions while suppressing irrelevant details7.Single-model limitations: Individual networks often fail to capture the diverse and complex features present in histopathological images, particularly in fine-grained lesion analysis8.
To address these challenges, we propose a novel model based on three integrated convolutional neural networks (CNNs), which combines domain-specific transfer learning and multi-model feature fusion for the classification of colorectal cancer histopathological images with complex features. Our approach aims to: Mitigate domain shift by aligning feature distributions between source and target domains. Enhancing feature representation through multi-head self-attention, capturing multi-scale features, and facilitating cross-model feature fusion. Leverage the complementary strengths of multiple pre-trained models to improve classification performance.
The key contributions of this work are:
Domain-specific transfer learning strategy: To address the challenges of data scarcity and domain shift in histopathological image analysis, we propose a novel transfer learning approach. By pretraining CNN models on domain-relevant source datasets and fine-tuning them on the target domain, the method enhances feature representation and improves generalization performance across heterogeneous datasets.
Multi-model feature fusion framework: We design an integrated feature fusion architecture that incorporates a multi-head self-attention mechanism to dynamically model the relationships and complementarities among features extracted from multiple pretrained CNN models. An MLP classifier is further employed to adaptively aggregate these features, leading to improved classification accuracy and robustness.
Comprehensive evaluation across diverse scenarios: Extensive experiments are conducted on three real-world histopathological image datasets (EBHI, Chaoyang, and COAD), which cover challenges such as multi-scale structures, class imbalance, label noise, and fine-grained subtypes. The proposed method consistently outperforms state-of-the-art approaches, demonstrating its effectiveness, adaptability, and practical value in complex clinical settings.
Related works
Deep learning
The application of deep learning, to medical image analysis has garnered increasing attention due to its demonstrated feasibility and effectiveness. Convolutional neural networks (CNNs) have been widely recognized for their superior performance in colorectal cancer histopathology image classification. Karthikeyan et al. developed a CNN-based model for colorectal cancer detection, incorporating Gaussian filtering and Otsu thresholding for image pre-processing9. Kumar et al. proposed a lightweight CNN architecture for automated multi-class classification of colorectal histopathological images using the publicly available colorectal histology and NCT-CRC-HE-100 K datasets10. Previous studies have demonstrated that deep learning models have achieved excellent results in the classification of colorectal cancer histopathological images11,12. The strong performance of deep learning techniques is attributed to the availability of large amounts of annotated data for model training. However, the availability of training samples may be relatively limited in the field of medical imaging. Chen et al. employed convolutional neural networks (CNNs) to automatically learn fine-grained features for cervical cell classification tasks with limited data and complex classification challenges, using a combination of transfer learning and snapshot ensemble techniques13.This approach reduced the dependency on large annotated datasets, improving the model’s performance on small-sample data. D’Souza et al. highlight the increasing importance of network architecture as the number of available samples decreases14. This underscores the significance of both network structure and the nature of the data when addressing small data challenges, requiring a shift in thinking from traditional big data training paradigms.
Transfer learning
Transfer learning is an excellent solution for addressing insufficient training data. While it reduces the reliance of deep learning on large-scale datasets to some extent, it still has its limitations. Cherti et al. 15 investigated the impact of pre-training scale in cross-domain transfer learning. While large-scale pre-training showed notable improvements in same-domain transfers (natural-to-natural, medical-to-medical), its benefits in cross-domain transfers (natural-to-medical), especially in small or limited datasets, were less pronounced. However, a critical limitation of such studies is their reliance on ImageNet as the primary pre-training dataset, which may not adequately capture the domain-specific features required for medical image analysis.
Yong et al.16 used a set of intermediate histopathological datasets for additional fine-tuning of the models in the transfer learning approach to address the differences in features and distributions between the training and testing data. While this approach alleviated some of the domain shift issues, it was only an ad hoc choice and lacked evaluation tests to ensure optimal feature alignment. Wang et al.17 introduced a simple and effective transfer learning method, Easy Transfer Learning, which learns non-parametric transferable features and classifiers by leveraging domain structures, eliminating the need for complex model selection, and improving classification accuracy and algorithmic efficiency.
Despite its simplicity, this method does not explicitly address the challenge of identifying or constructing source domains that are inherently compatible with the target domain. Current methods often suffer from inherent domain mismatch between the source (natural images) and target (medical images) domains, leading to suboptimal feature representations and limited knowledge transfer, particularly for fine-grained histopathological patterns. Therefore, selecting an appropriate source domain to minimize the divergence between the source and target feature spaces is critical for enhancing the classification performance of transfer learning methods on limited-sample datasets.
Feature fusion
Multi-feature fusion is a technique that integrates features from different sources, types, or scales to enhance a model’s ability to understand complex problems and improve predictive performance. Cao et al.18 proposed a novel colorectal histopathological image classification method based on progressive multi-granularity feature fusion, which combines global features with local granular features and employs a progressive learning strategy to enhance classification accuracy. Evaluated on three public datasets, the proposed method outperformed existing methods, achieving classification accuracies of 96.6% and 92.3%, with corresponding precision, recall, and F1-score all exceeding 92%. However, the method primarily relies on straightforward feature concatenation, which may potentially limit its generalizability to more diverse datasets. Liang et al. proposed a multi-scale feature fusion convolutional neural network (MFF-CNN) based on shearlet transform, which integrates shearlet coefficients from multiple decomposition scales with original histopathological images for feature learning and fusion19. The method achieved a 96% identification accuracy, an average Fig.-score of 95.94%, and significantly reduced false negative and false positive rates (5.5% and 2.5%). Despite its success, the approach does not explicitly evaluate the complementary contributions of features extracted from different scales.
A multi-scale high and low feature fusion attention network has been proposed by Li et al., incorporating multi-scale feature extraction, detail capture attention, and dense sampling fusion to enhance the integration of lesion information from superficial and deep layers20. This method achieves 98% classification accuracy on the Kvasir dataset and 97.52% on a private dataset although the introduction of the attention mechanism represents a step forward, this study has not thoroughly explored the varying classification performance of different feature datasets, leaving room for further optimization.
In summary, while current feature fusion methods have demonstrated significant improvements in histopathological image classification, they often rely on simplistic fusion strategies, such as concatenation, without adequately assessing the complementary nature of the features being combined. This oversight can result in redundant or less informative feature representations, limiting the full potential of multi-feature fusion.
Self-attention mechanism
The self-attention mechanism, originally introduced in natural language processing, captures internal feature correlations within a single sample through three weight matrices, reducing reliance on external information. Its integration with convolutional layers has since driven notable advancements in image classification tasks21. The multi-head self-attention mechanism, first introduced by Vaswani et al.22 in the Transformer model, enables the learning of diverse feature representations via multiple attention heads. Wang et al.23 applied this mechanism to arrhythmia classification, achieving 99.4% accuracy, 99.41% specificity, and 97.36% sensitivity by effectively capturing global contextual relationships. Similarly, Huang et al.24 combined a Feature Pyramid Network with Vision Transformers to enhance multi-scale feature extraction and focus on critical regions, yielding 4.65%–6.24% improvements in accuracy and F1-score across four medical image datasets. Despite these advancements, the application of self-attention in colorectal cancer (CRC) histopathological image analysis remains underexplored, particularly in addressing complex tissue structures and noisy data. This study systematically investigates colorectal cancer image classification by integrating transfer learning, feature fusion, and self-attention mechanisms. Focusing on datasets with multi-scale features, class imbalance, noisy labels, and fine-grained subtypes, the research aims to identify the limitations of multi-model deep learning approaches and develop more robust solutions for histopathological image analysis.
Data pre-processing
Datasets
In our study, we selected four datasets with distinct characteristics to evaluate the performance of our proposed model from multiple perspectives. Among them, the EBHI25, Chaoyang26, and COAD27 datasets serve as target datasets, encompassing binary and multi-class tasks, as well as challenges such as multi-scale features, class imbalance, noisy labels, and fine-grained classification. This diversity allows for a comprehensive assessment of the model’s performance across various scenarios. Additionally, the NCT-CRC-HE-100K28 dataset is utilized as the source dataset, aiming to enhance the model’s generalization capability by enabling it to learn rich colorectal cancer features through domain-specific transfer learning.
EBHI dataset
The EBHI (Enteroscope Biopsy Histopathological Hematoxylin and Eosin Image) dataset is a newly released public dataset jointly developed by Northeastern University, China Medical University, Liaoning Provincial Cancer Hospital, and the Liaoning Provincial Cancer Research Institute25. It comprises 5532 histopathological electron microscopy images of colorectal cancer across four magnifications (40 × , 100 × , 200 × , and 400 ×). The dataset covers five tumor differentiation stages: Adenocarcinoma, High-grade Intraepithelial Neoplasia (High-grade IN), Low-grade IN, Polyp, and Normal—which are further grouped into two categories: Benign (Normal, Polyp, Low-grade IN) and Malignant (High-grade IN, Adenocarcinoma).
Figure 1 shows representative images from the EBHI dataset across five tumor stages and four magnifications (40 × , 100 × , 200 × , 400 ×). Table 1 summarizes the distribution of image counts by stage and magnification. The dataset contains 5532 images, reflecting multi-scale characteristics and varying resolutions. Notably, the distribution across categories and magnifications is imbalanced. Thus, the EBHI dataset represents a multi-class, multi-scale dataset with imbalanced class distribution and a limited sample size.
Fig. 1 [Images not available. See PDF.]
Sample images from the EBHI dataset (Containing five classes, each with 40 × , 100 × , 200 × , 400 × magnifications).
Table 1. Distribution of EBHI dataset images by class and magnification (40 × , 100 × , 200 × , 400 ×).
Classes/magnification | 40 × | 100 × | 200 × | 400 × | Total |
|---|---|---|---|---|---|
Adenocarcinoma | 205 | 471 | 790 | 812 | 2278 |
High-grade IN | 47 | 80 | 130 | 161 | 418 |
Low-grade IN | 204 | 341 | 603 | 660 | 1808 |
Polyp | 119 | 165 | 254 | 304 | 842 |
Normal | 17 | 29 | 61 | 79 | 186 |
Total | 592 | 1086 | 1838 | 2016 | 5532 |
Chaoyang dataset
The Chaoyang dataset comprises 6160 image patches (512 × 512 pixels, 20 × magnification) extracted from colorectal tissue sections26. It includes four classes: normal, serrated, adenocarcinoma, and adenoma. Annotations were conducted by three expert pathologists. The test set consists of patches with unanimous agreement, while the training set includes samples with approximately 40% annotation disagreement. For these cases, the final label was randomly selected from one of the three pathologists’ annotations.
Figure 2 presents representative samples from the four classes in the Chaoyang dataset, with class distribution detailed in Table 2. Of the 6160 image patches, the adenoma class has notably fewer samples, resulting in class imbalance. Additionally, the presence of annotation inconsistencies introduces label noise, making the Chaoyang dataset a real-world benchmark characterized by limited data, noisy labels, and class imbalance.
Fig. 2 [Images not available. See PDF.]
Sample of images of four classes from Chaoyang dataset.
Table 2. Data scale and distribution of the Chaoyang, COAD, and CT-CRC-HE-100 K.
Dataset | Size | Classes | Images/patches |
|---|---|---|---|
Chaoyang | 6160 | Normal | 1816 |
Serrated | 1163 | ||
Adenocarcinoma | 2244 | ||
Adenoma | 937 | ||
COAD | 192,312 | MSS | 117,273 |
MSIMUT | 75,039 | ||
NCT-CRC-HE-100 K | 100,000 | ADI | 10,407 |
BACK | 10,566 | ||
DEB | 11,512 | ||
LYM | 11,557 | ||
MUC | 8896 | ||
MUS | 13,536 | ||
NORM | 8763 | ||
STR | 10,446 | ||
Tum | 14,317 |
COAD dataset
The Colon Adenocarcinoma (COAD) dataset contains histological image tiles extracted from Whole Slide Images (WSIs) of colorectal cancer patients in the TCGA cohort27. Original SVS slides were color-normalized using the Macenko method and converted to JPG format for consistency. Patients were labeled by specialists as either “MSS” (microsatellite stable) or “MSIMUT” (microsatellite instable/highly mutated), providing binary classification labels.
Figure 3 presents representative samples from the two categories in the COAD dataset, with detailed class distribution shown in Table 2. The dataset contains 192,312 uniformly stained image patches (224 × 224 pixels), publicly available on Kaggle. Designed for fine-grained tissue subtype recognition, COAD serves as a large-scale binary classification dataset.
Fig. 3 [Images not available. See PDF.]
Sample of images of binary classes from COAD dataset.
NCT-CRC-HE-100 K dataset
The NCT-CRC-HE-100 K dataset consists of 100,000 non-overlapping image patches extracted from H&E-stained histological images of colorectal cancer and normal tissue28. It includes nine tissue classes: adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), and colorectal adenocarcinoma epithelium (TUM). All images are color-normalized using the Macenko method and resized to various pixel dimensions with a resolution of 0.5 microns per pixel.
Figure 4 displays representative samples from the nine categories in the NCT-CRC-HE-100 K dataset, with detailed class distribution presented in Table 2. The dataset includes 100,000 image patches evenly distributed across all categories. As one of the largest and most diverse colorectal cancer datasets with high-quality, color-normalized images, it provides a strong foundation for training models to learn comprehensive CRC tissue characteristics.
Fig. 4 [Images not available. See PDF.]
Sample of images of nine classes from NCT-CRC-HE-100 K dataset.
Data cropping
Image data preprocessing, including resizing, pixel normalization, and data augmentation, ensures input consistency, reduces noise, and improves model training efficiency and generalization. In this study, we addressed the large image sizes in the EBHI (2048 × 1536) and Chaoyang (512 × 512) datasets by designing a patch cropping algorithm to generate input-suitable image tiles for classification. Figure 5 illustrates the workflow using a sample from the EBHI dataset.
Fig. 5 [Images not available. See PDF.]
The complete workflow of data cropping.
The patch cropping algorithm involves three main steps:
Image tiling : Original images are divided into 256 × 256 pixel patches to match network input dimensions.
Empty patch removal : Patches with less than 10% foreground content are automatically discarded to eliminate non-informative regions.
Patch selection : Remaining valid patches are used for model training and evaluation, ensuring an optimized and informative dataset.
Data augmentation
To enhance sample diversity and improve model generalization, we applied data augmentation techniques to the training datasets. This approach mitigates overfitting and improves model robustness, particularly for datasets with limited samples. A total of 10 augmentation methods were employed, including random rotation, scaling, cropping, flipping, translation, brightness/contrast adjustment, hue adjustment, Gaussian noise, multi-augmentation combinations, and normalization. The study utilized three datasets with varying sample sizes and class distributions. Augmentation intensity was adapted to each dataset based on class imbalance. Augmentations were applied randomly and proportionally to class size, ensuring a more balanced class distribution to support effective classification.
Methodology
Base networks
In this study, we employ transfer learning by leveraging source domain datasets to pre-train deep neural networks as effective feature extractors. To build the core of our proposed model, we selected five high-performing convolutional neural networks (CNNs): ResNet152, DenseNet201, Xception, EfficientNetB0, and EfficientNetV2M—based on criteria such as network diversity, feature complementarity, transferability, and task-specific adaptability. Each selected model represents a widely used architecture in the literature. Through systematic evaluation, we aim to identify optimal network combinations that enable cross-hierarchical and multi-perspective feature representation, facilitating interaction among heterogeneous features and improving model performance in complex medical imaging tasks.
ResNet152, one of the deepest models in the ResNet series, effectively captures intricate pathological features through its deep residual architecture29. Fine-tuning enables rapid adaptation to target data distributions, making it well-suited for high-precision classification tasks involving strong semantic understanding, such as lesion classification.
DenseNet201 utilizes dense connections to promote cross-layer feature reuse, with each dense block containing 6 to 48 convolutional layers30. This structure excels at preserving fine-grained local details, making it highly effective for differentiating visually similar cancer subtypes in histopathological images.
Xception, based on depthwise separable convolutions, offers superior spatial detail modelling, particularly useful in high-resolution image analysis such as nuclear boundary detection31. Its efficient architecture reduces computational load and overfitting risk, enhancing generalization in cross-domain applications.
EfficientNetB0 achieves a balance between accuracy and efficiency through compound scaling strategies32. It is well-suited for lightweight texture classification tasks and operates with lower computational cost compared to larger variants (EfficientNetB1–B7), making it ideal for resource-constrained environments.
EfficientNetV2M, a recent advancement in the EfficientNet series, integrates progressive training and Fused-MBConv modules to optimize both efficiency and accuracy33. It performs well in complex semantic modelling tasks and adapts effectively to target domain data, making it suitable for high-precision medical applications.
Domain-specific transfer learning
Transfer learning enables the application of knowledge learned from large datasets to new target domains, addressing the challenge of limited data availability34. This approach is particularly effective for histopathological image classification tasks with small sample sizes.
To enhance transfer effectiveness, we propose a domain-specific transfer learning scheme that guides networks to learn features from source domains closely related to the target domain. During the fine-tuning phase, we adopt a hierarchical layer-freezing strategy. In general, the shallow layers of convolutional neural networks tend to capture low-level, generic features such as edges and textures, while deeper layers extract high-level semantic information more relevant to the specific task. Based on this hierarchical structure, we progressively unfreeze layers from shallow to deep network. The core idea of this strategy is to preserve the general representations learned from the source domain while enabling deeper layers to adapt more effectively to the feature distribution of the target domain. This approach enhances both the accuracy and adaptability of feature extraction, thereby addressing the challenges posed by limited data in histopathological image classification tasks. The workflow involves three key steps:
Network initialization : Networks are initialized with pre-trained weights from ImageNet.
Domain-specific pre-training : Selected layers are frozen, and pre-training is performed on a source dataset similar to the target domain, enabling the network to learn domain-relevant features.
Fine-tuning : The model is fine-tuned on the target dataset, with the fully connected layers replaced and adjusted to match the number of target classes, resulting in a task-specific pretrained model.
Figure 6 illustrates the proposed workflow. In this study, the NCT-CRC-HE-100 K dataset is used as the source dataset due to its larger size and richer category coverage compared to other colorectal cancer datasets. By leveraging this dataset, CNNs can acquire more comprehensive histopathological feature representations relevant to colorectal cancer.
Fig. 6 [Images not available. See PDF.]
The workflow of the domain-specific transfer learning.
Multi-model feature fusion framework
Effectively fusing features from multiple distinct networks can significantly improve classification performance35. This paper presents a multi-model feature fusion framework based on the multi-head self-attention mechanism. The core idea is to dynamically assign weights to the features from each pre-trained network using the multi-head self-attention mechanism, enabling adaptive fusion of features.
The proposed framework consists of the following five stages:
Stage 1: Feature extraction
We utilize pre-trained CNNs to extract features from the input target dataset. The selected base networks are the top three performers identified in the domain-specific transfer learning phase. Each network outputs a feature vector corresponding to the number of classes in the classification task, encoding rich semantic information from varying architectural perspectives.
Stage 2: Feature concatenation
To enable interaction among features from different base models, we concatenate the output feature vectors along the class dimension, forming a unified feature matrix. Due to inherent differences in model architecture, the resulting features may vary in scale and semantics. To address this, a linear transformation layer projects the concatenated matrix into a common feature space aligned with the self-attention input dimension. This step mitigates dimensional mismatches and prepares the features for effective interaction within the MHSA module, enabling the modeling of global dependencies and relative feature importance.
Stage 3: Multi-head self-attention mechanism
The Multi-Head Self-Attention (MHSA) mechanism, a core component of the Transformer architecture, partitions the feature space into multiple subspaces, where each head independently learns specific dependencies among input elements. These representations are then fused to form a unified, enriched feature representation. MHSA excels in modelling long-range dependencies and integrating heterogeneous features.
In this study, MHSA is innovatively applied to multi-model feature fusion, enabling dynamic interaction among heterogeneous features from multiple base models and facilitating effective integration of multi-scale pathological features. To address the architectural diversity of feature representations, MHSA is employed to achieve three key objectives:
Dynamic weight assignment: Each attention head computes a feature similarity matrix, and SoftMax normalization assigns dynamic weights, quantifying each base model’s feature contribution.
Cross-model feature interaction: Scaled Dot-Product Attention captures global dependencies across feature representations from different models.
Multi-scale feature integration: Parallel attention heads extract complementary pathological features across scales, enhancing the model’s discriminative capability for complex patterns.
Stage 4: Feature fusion
The output feature matrix, weighted by the Multi-Head Self-Attention (MHSA) mechanism, contains dynamically assigned contributions from each base network. To obtain the final fused representation, we compute the average of the attention-weighted features across the base network dimension, resulting in a compact and enriched feature vector.
Stage 5: MLP classifier classification
The fused feature vector is then passed to an MLP classifier, designed to perform the final prediction by consolidating heterogeneous features. The classifier operates within an MLP classifier framework and is implemented using a Multilayer Perceptron (MLP). It enhances cross-model interaction and refines the fused representation to improve classification accuracy.
The MLP follows a simple yet effective two-layer architecture:
The hidden layer receives the attention-weighted features and maps them to a 64-dimensional latent space.
The output layer includes neurons corresponding to the number of target classes and produces the final classification output.
To ensure robust training, we apply the Rectified Linear Unit (ReLU) activation function in the hidden layer to mitigate the vanishing gradient issue and promote stable learning. Additionally, a Dropout layer with a rate of 0.3 is used after the hidden layer to improve generalization and reduce overfitting by randomly deactivating a subset of neurons during training.
Hyperparameters such as the number of hidden units and dropout rate were optimized through manual tuning and validation based on training dynamics.
Figure 7 illustrates the MHSA workflow, and the detailed parameter settings and procedural steps are outlined in Algorithm 1.
Fig. 7 [Images not available. See PDF.]
Workflow of the multi-head self-attention mechanism.
Figure 8 illustrates the workflow of the MLP classifier, and detailed implementation steps are provided in Algorithm 1.
Fig. 8 [Images not available. See PDF.]
Workflow of MLP classifier.
Building upon the five-stage hierarchical design framework outlined above, the core processing workflow of the proposed multi-model feature fusion classification algorithm is formally detailed in the following pseudocode.
To address the challenges of limited data, heterogeneous features, and fine-grained classification in colorectal cancer histopathology, we developed a comprehensive deep learning framework that combines domain-specific transfer learning with advanced feature fusion strategies. This approach integrates knowledge from multiple pre-trained networks and enhances feature representation through self-attention mechanisms. Figure 9 provides a comprehensive overview of the processing framework of the proposed deep learning model, which is based on domain-specific transfer learning and a multi-model feature fusion framework leveraging multi-head self-attention mechanisms.
Fig. 9 [Images not available. See PDF.]
Architecture of the proposed model based on domain-specific transfer learning and multi-model feature fusion framework.
Evaluation metrics
In this study, accuracy, precision, recall, and F1-score are employed as evaluation metrics, all of which are widely used in image classification tasks. To ensure a balanced assessment across all categories, we adopt the macro-averaging strategy, which treats each class equally regardless of sample size.
Accuracy reflects the proportion of correctly classified samples among all samples.
Precision measures the proportion of true positives among all predicted positives.
Recall quantifies the proportion of true positives among all actual positives.
F1-score provides a harmonic mean of precision and recall, offering a balanced evaluation.
Formulas (1)–(4) define these metrics, where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively.
1
2
3
4
Experiments and results
Experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3090 GPU and 128 GB of memory, using PyTorch for model implementation. The training was configured with an initial learning rate of 0.0001, Adam optimizer, batch size of 64, and 100 epochs. Datasets were split into 70% training, 20% validation, and 10% testing.
The experimental procedure comprised the following steps:
Transfer learning
Five deep CNNs were pre-trained on a source dataset. To evaluate fine-tuning strategies, different layer-freezing configurations were tested (freezing half, one-quarter, and all layers except the final layer).
Domain-specific transfer learning
The pre-trained networks were fine-tuned on three target datasets, and their classification performance was evaluated.
Feature fusion
The top three performing networks were selected and integrated using the proposed multi-head self-attention-based fusion method to generate final classification predictions.
Model evaluation
The final model was assessed using accuracy, precision, recall, and F1-score to validate its robustness and effectiveness.
Experimental results of the proposed model
We initialized ResNet152, DenseNet201, Xception, EfficientNetB0, and EfficientNetV2M with ImageNet pre-trained weights and performed transfer learning using the NCT-CRC-HE-100 K dataset as the source. Three layer-freezing strategies were evaluated: freezing half of the layers, one-quarter of the layers, and fine-tuning only the final fully connected layer. As shown in Table 3, freezing half of the layers yielded the highest classification accuracy, with EfficientNetV2M achieving 99.51%. This result suggests that allowing more layers to be retrained enables the network to learn richer, domain-specific features, thereby improving classification performance.
Table 3. Classification accuracy (%) of five CNN architectures using three layer-freezing strategies during transfer learning on the NCT-CRC-HE-100 K dataset. The highest accuracy for each model is highlighted in bold. The “Freeze 1/2 Layer” strategy yielded the best overall performance.
CNNs | Freeze 1/2 layer | Freeze 1/4 layer | Only fine-tune last layer |
|---|---|---|---|
Resnet152 | 99.24 ± 0.07 | 99.17 ± 0.06 | 96.81 ± 0.14 |
Densenet201 | 99.42 ± 0.05 | 99.02 ± 0.08 | 98.46 ± 0.21 |
Xception | 99.48 ± 0.05 | 99.23 ± 0.07 | 98.70 ± 0.13 |
EfficientNetB0 | 99.35 ± 0.06 | 98.71 ± 0.09 | 98.75 ± 0.15 |
EfficientNetV2M | 99.51 ± 0.05 | 99.41 ± 0.05 | 98.76 ± 0.17 |
Based on prior results, freezing half of the network layers yielded the best classification performance. Accordingly, five pre-trained networks with 1/2 layers frozen were selected for domain-specific transfer learning and evaluated on three target datasets: EBHI, Chaoyang, and COAD. Table 4 presents the classification accuracy on the validation sets of these datasets.
Table 4. Classification accuracy (%) of five pre-trained networks after domain-specific transfer learning on three target datasets. Among the models, Xception, EfficientNetB0, and EfficientNetV2M demonstrated the best overall performance. Highest accuracies are highlighted in bold.
CNNs | EBHI | Chaoyang | COAD | |||
|---|---|---|---|---|---|---|
40 × | 100 × | 200 × | 400 × | |||
Resnet152 | 94.58 ± 0.19 | 95.36 ± 0.16 | 95.95 ± 0.17 | 95.21 ± 0.20 | 75.61 ± 1.24 | 93.17 ± 0.45 |
Densenet201 | 95.34 ± 0.18 | 96.62 ± 0.19 | 97.67 ± 0.16 | 96.83 ± 0.17 | 82.23 ± 1.32 | 96.11 ± 0.66 |
Xception | 96.88 ± 0.12 | 97.33 ± 0.14 | 97.89 ± 0.13 | 96.80 ± 0.15 | 83.91 ± 1.19 | 97.37 ± 0.42 |
EfficientNetB0 | 97.10 ± 0.16 | 97.46 ± 0.15 | 97.94 ± 0.14 | 96.99 ± 0.12 | 84.04 ± 1.23 | 97.40 ± 0.59 |
EfficientNetV2M | 97.03 ± 0.19 | 97.52 ± 0.17 | 97.82 ± 0.15 | 97.58 ± 0.16 | 83.27 ± 1.47 | 96.72 ± 0.58 |
As shown in Table 4, Xception, EfficientNetB0, and EfficientNetV2M demonstrated superior performance. Notably, EfficientNetV2M achieved the highest accuracy of 97.52% and 97.58% on the 100 × and 400 × subsets of EBHI, respectively. EfficientNetB0 recorded the best performance on the 40 × (97.10%) and 200 × (97.94%) subsets of EBHI, and attained the highest accuracy on the Chaoyang (84.04%) and COAD (97.40%) datasets.
We selected the top three networks from the domain-specific transfer learning phase: Xception, EfficientNetB0, and EfficientNetV2M, as base classifiers in the proposed feature fusion framework. These models were used for feature extraction, followed by final classification using a Multi-Head Self-Attention mechanism on three colorectal cancer datasets.
As shown in Table 5, the proposed model achieved strong performance across all four magnification subsets of the EBHI dataset: 98.79% (40 ×), 99.05% (100 ×), 99.68% (200 × , highest), and 99.28% (400 ×). On the Chaoyang dataset, the model achieved 86.72%, and on the COAD dataset, 99.44% accuracy. Additionally, high scores were recorded across precision, recall, and F1-score, indicating robust and consistent classification performance.
Table 5. Classification performance (%) of the proposed model on three colorectal cancer datasets, including subsets with varying magnification levels. The model demonstrates strong and consistent performance across all evaluation metrics.
Datasets | Accuracy | Precision | Recall | F1-Score | |
|---|---|---|---|---|---|
EBHI | 40 × | 98.79 ± 0.11 | 98.76 ± 0.12 | 98.78 ± 0.11 | 98.77 ± 0.13 |
100 × | 99.05 ± 0.13 | 98.87 ± 0.10 | 98.92 ± 0.12 | 98.89 ± 0.12 | |
200 × | 99.68 ± 0.08 | 99.67 ± 0.07 | 99.69 ± 0.09 | 99.68 ± 0.08 | |
400 × | 99.28 ± 0.06 | 99.29 ± 0.08 | 99.27 ± 0.06 | 99.29 ± 0.07 | |
Chaoyang | 86.72 ± 1.29 | 85.54 ± 1.57 | 85.90 ± 1.52 | 85.76 ± 1.61 | |
COAD | 99.44 ± 0.07 | 99.39 ± 0.05 | 99.41 ± 0.07 | 99.42 ± 0.06 | |
Ablation study analysis
Our proposed model achieves joint optimization of cross-domain feature representation and multi-model feature interaction through a synergistic architecture that combines a domain-specific transfer learning approach with a multi-head self-attention (MHSA) feature fusion mechanism. Both components play a critical role in enhancing the model’s classification performance. To quantitatively evaluate the necessity and contribution of each component, we conducted five sets of ablation experiments: (1) effectiveness of domain-specific transfer learning, (2) impact of the MHSA feature fusion mechanism, (3) hyperparameter comparison of attention head numbers, (4) performance evaluation of different network combinations, and (5) visualization of model interpretability through Grad-CAM heatmaps. The primary evaluation metric was the average classification accuracy across each dataset under the different experimental settings.
First ablation experiment: effectiveness of domain-specific transfer learning
In the first ablation experiment, we assessed the effectiveness of the domain-specific transfer learning strategy and its impact on model performance. The experimental design excluded the MHSA feature fusion mechanism, and a classification model was constructed by directly integrating three pre-trained CNNs. Two transfer learning strategies were compared: a baseline approach and the proposed domain-specific method.
Baseline Transfer Learning (BTL): Only the fully connected layers of the three CNNs were fine-tuned using target domain data, following standard transfer learning practice.
Domain-Specific Transfer Learning (DTL): The proposed method was applied, freezing the first half of the convolutional layers in each CNN, while the remaining layers were pre-trained on a source dataset before fine-tuning on the target task.
The classification accuracy results on the three target datasets are shown in Fig. 10.
Fig. 10 [Images not available. See PDF.]
Validation of the effectiveness of domain-specific transfer learning (DTL). Classification accuracy (%) is compared between baseline transfer learning (BTL) and DTL across multiple datasets. DTL consistently outperforms BTL, highlighting its superiority in feature adaptation and generalization.
As illustrated in Fig. 10, the DTL method (green bars) consistently outperformed the BTL method (blue bars) across all datasets. On the EBHI dataset, DTL yielded an accuracy improvement ranging from 0.58% to 0.66% across different magnification levels. On the Chaoyang dataset, the improvement was 0.64%, and on the COAD dataset, 0.79%. These results confirm that DTL offers more stable and effective performance improvements across datasets with diverse characteristics, demonstrating its robustness and advantage over standard transfer learning.
Second ablation experiment: contribution of the multi-head self-attention mechanism
In the second ablation experiment, we evaluated the impact of the multi-head self-attention (MHSA) feature fusion mechanism on overall model performance. The experiment was designed as follows:
In the first configuration, we constructed a domain-specific transfer learning model (DTL) using simple feature concatenation as the multi-branch fusion strategy.
In the second configuration (DTL + MHSA), we incorporated the MHSA module into the DTL framework to dynamically aggregate heterogeneous features via attention-based weighting.
Classification experiments were conducted on three datasets. By comparing the two configurations, we assessed the effectiveness of MHSA in enhancing feature fusion and improving classification outcomes. The results are presented in Fig. 11.
Fig. 11 [Images not available. See PDF.]
Validation of the multi-head self-attention feature fusion mechanism effectiveness. Feature fusion based on MHSA, combined with domain-specific transfer learning (DTL + MHSA), achieved better classification accuracy. Classification accuracy is measured in % unit.
As shown in Fig. 11, the DTL + MHSA model (green bars) consistently outperforms the DTL without MHSA (blue bars), confirming the contribution of MHSA to model performance.
On the EBHI dataset, MHSA improved classification accuracy by 0.52% to 0.76% across magnification levels, indicating its strength in capturing cross-scale discriminative features.
On the Chaoyang dataset, accuracy increased by 1.85%, demonstrating MHSA’s robustness in handling noisy data and identifying key pathological patterns.
On the COAD dataset, MHSA led to a 1.12% improvement, underscoring its ability to enhance the extraction of local structural features critical for fine-grained tissue subtype classification.
These results collectively demonstrate that integrating MHSA-based fusion into the domain-specific transfer learning framework significantly strengthens the model’s ability to recognize complex patterns and improves overall classification performance.
Third ablation experiment: influence of attention head count in MHSA
In the third ablation experiment, we investigated the effect of the number of attention heads in the Multi-Head Self-Attention (MHSA) mechanism on classification performance. The experiment was designed as follows: multiple models were constructed based on the DTL + MHSA architecture, varying only the number of attention heads (1, 2, 4, 8, 16, and 32), while keeping all other hyperparameters constant. This setup aimed to explore how head count affects the ability of MHSA to capture and integrate heterogeneous features across base networks. The models were evaluated on three target datasets, and the classification accuracies are presented in Table 6.
Table 6. Comparison of classification accuracy (%) across different numbers of attention heads in the MHSA mechanism. The 8-head configuration achieved the highest performance. Best results are highlighted in bold.
Number of heads | EBHI | Chaoyang | COAD | |||
|---|---|---|---|---|---|---|
40 × | 100 × | 200 × | 400 × | |||
1 head | 97.31 ± 0.67 | 97.61 ± 0.45 | 97.89 ± 0.32 | 97.84 ± 0.26 | 85.02 ± 1.13 | 98.01 ± 0.08 |
4 heads | 98.32 ± 0.35 | 98.39 ± 0.12 | 99.16 ± 0.21 | 99.04 ± 0.09 | 85.83 ± 1.09 | 99.36 ± 0.05 |
8 heads | 98.79 ± 0.11 | 99.05 ± 0.13 | 99.69 ± 0.08 | 99.28 ± 0.06 | 86.72 ± 1.29 | 99.44 ± 0.07 |
16 heads | 98.61 ± 0.13 | 98.89 ± 0.10 | 99.54 ± 0.16 | 99.15 ± 0.07 | 86.58 ± 1.14 | 99.32 ± 0.05 |
32 heads | 98.57 ± 0.21 | 98.78 ± 0.11 | 99.43 ± 0.09 | 99.17 ± 0.03 | 86.53 ± 1.10 | 99.28 ± 0.06 |
As shown in Table 6, performance improved consistently with an increasing number of heads up to 8, after which further increases led to a plateau or slight decline:
On the EBHI dataset, classification accuracy increased with head count from 1 to 8, reaching peak values of 98.79%, 99.05%, 99.69%, and 99.28% on the 40 × , 100 × , 200 × , and 400 × subsets, respectively. Performance declined slightly at 16 and 32 heads.
On the Chaoyang dataset, accuracy rose from 85.02% (1 head) to a maximum of 86.72% (8 heads), then stabilized or slightly declined with higher head counts.
On the COAD dataset, accuracy improved from 98.01% (1 head) to a peak of 99.44% at 8 heads, with no further gains observed at 16 or 32 heads.
These findings suggest that MHSA improves feature representation through parallel subspace modelling, enabling more effective multi-model feature integration. However, increasing the number of heads beyond a certain point does not yield additional benefits and may introduce redundancy or noise. The 8-head configuration, adopted in our proposed model, achieved the best balance between performance, stability, and generalizability across all datasets.
Fourth ablation experiment: impact of CNN model combinations on classification performance
The fourth ablation experiment explored how different combinations of convolutional neural networks (CNNs) influence classification performance when integrated within the domain-specific transfer learning (DTL) and multi-head self-attention (MHSA) framework. To this end, we evaluated seven configurations, ranging from individual CNNs to dual- and triple-network combinations, allowing us to examine not only the performance of each model but also the synergies that emerge through feature fusion.
Single-model setups, using only DTL without MHSA, consistently produced lower accuracy across all three datasets. While these individual networks—Xception, EfficientNetB0, and EfficientNetV2M—are capable of capturing meaningful features on their own, they appear limited in their ability to generalize across the diverse structural patterns present in histopathological images. This suggests that relying on a single model constrains the representational depth and fails to fully capture the complexity of the data.
In contrast, combining two networks significantly improved performance. These dual-model configurations benefitted from complementary feature extraction capabilities, with each network contributing unique perspectives to the classification task. Particularly on the EBHI dataset, we observed marked gains across magnification levels, indicating that multi-model fusion helps in capturing both low- and high-resolution characteristics of pathological regions. Similar trends were observed on the Chaoyang and COAD datasets, highlighting the generalizability of this improvement across datasets with varying noise levels and class distributions.
Most notably, the integrated three-model combination—Xception, EfficientNetB0, and EfficientNetV2M—outperformed all other setups across all datasets. This configuration represents the proposed final model and showcases the cumulative advantage of combining diverse architectures. The ensemble not only achieves the highest accuracy but also demonstrates consistent robustness across different magnifications and dataset characteristics. As shown in Table 7, this configuration consistently achieves top performance across EBHI, Chaoyang, and COAD datasets, reinforcing the value of multi-model feature fusion, particularly when guided by MHSA, in capturing a broader and more nuanced range of discriminative features.
Table 7. Classification accuracy (%) of different CNN model combinations. The three-model configuration achieved the highest performance across all datasets. Best results are highlighted in bold.
CNN models | EBHI | Chaoyang | COAD | |||
|---|---|---|---|---|---|---|
40 × | 100 × | 200 × | 400 × | |||
Xception only | 96.88 ± 0.12 | 97.33 ± 0.14 | 97.89 ± 0.13 | 96.80 ± 0.15 | 83.91 ± 1.19 | 97.37 ± 0.42 |
EfficientNetB0 only | 97.10 ± 0.16 | 97.46 ± 0.15 | 97.94 ± 0.14 | 96.99 ± 0.12 | 84.04 ± 1.23 | 97.40 ± 0.59 |
EfficientNetV2M only | 97.03 ± 0.19 | 97.52 ± 0.17 | 97.82 ± 0.15 | 97.58 ± 0.16 | 83.27 ± 1.47 | 96.72 ± 0.58 |
Xception + EfficientNetB0 | 97.52 ± 0.49 | 98.02 ± 0.13 | 98.77 ± 0.10 | 98.28 ± 0.19 | 85.45 ± 1.13 | 98.06 ± 1.21 |
Xception + EfficientNetV2M | 97.27 ± 0.32 | 98.23 ± 0.15 | 98.85 ± 0.09 | 98.31 ± 0.22 | 84.96 ± 1.01 | 97.94 ± 1.19 |
EfficientNetB0 + EfficientNetV2M | 97.39 ± 0.41 | 98.31 ± 0.16 | 98.63 ± 0.11 | 98.33 ± 0.20 | 84.91 ± 1.13 | 97.90 ± 1.25 |
Xception + EfficientNetB0 + EfficientNetV2M | 98.79 ± 0.11 | 99.05 ± 0.13 | 99.68 ± 0.08 | 99.28 ± 0.06 | 86.72 ± 1.29 | 99.44 ± 0.07 |
Overall, the results from this experiment validate the hypothesis that integrating multiple CNNs through attention-guided fusion mechanisms can significantly enhance both accuracy and stability. The superior performance of the three-model configuration illustrates the strength of architectural diversity and cross-model interaction in advancing histopathological image classification.
Fifth ablation experiment: interpretability of the model via Grad-CAM visualization
In the fifth ablation experiment, we investigate the interpretability of the proposed model by visualizing Grad-CAM heatmaps under different configurations of the multi-head self-attention (MHSA) mechanism. Specifically, we compare heatmaps generated by models with: No MHSA, MHSA with a single head, 4 heads, 8 heads, and 16 heads. These models were applied to histopathological image samples of colorectal cancer tissues across three datasets. The heatmaps illustrate which regions the model focuses on during classification, revealing how different attention configurations affect the identification of discriminative features.
The intensity of the heatmap corresponds to the model’s attention: red indicates high attention, while blue indicates low attention. As shown in Fig. 12, the model without MHSA can still highlight some cancer-related regions, but its focus is less accurate—often missing important areas or misidentifying irrelevant regions. In contrast, the inclusion of MHSA significantly enhances the localization of discriminative features.
Fig. 12 [Images not available. See PDF.]
Grad-CAM heatmaps generated on the three datasets. The attention regions produced by models with MHSA are more accurate than No MHSA, with the 8-head MHSA focusing most precisely on the cancerous regions.
Furthermore, comparing MHSA with increasing head numbers (from 1 to 16), we observe that the attention becomes more concentrated and the heatmap intensity increases accordingly. This confirms that each attention head captures different feature dimensions, and more heads generally enable the model to focus on a broader set of discriminative features. However, we also note that while the 8 heads configuration provides the most accurate focus on cancerous regions, the 16 heads model tends to over-attend, sometimes highlighting non-cancerous areas. This suggests that more heads do not necessarily guarantee better classification performance.
Overall, the heatmaps in Fig. 12 provide intuitive visual evidence that incorporating MHSA enhances the model’s ability to identify relevant pathological features, thereby improving its classification performance. These results also demonstrate the interpretability of our proposed model.
Through the five ablation experiments, we systematically evaluated the impact of the proposed domain-specific transfer learning (DTL) strategy and multi-head self-attention (MHSA) feature fusion mechanism on model performance. The results demonstrate that DTL significantly enhances the model’s generalization and stability in medical image classification tasks. MHSA further improves performance by adaptively aggregating heterogeneous features, enabling more effective cross-scale feature extraction and recognition.
Optimization of the attention head count with MHSA revealed that a moderate configuration, specifically eight heads, yields the best performance, while further increases offer no additional benefit and may introduce redundancy. Experiments with different CNN combinations showed that multi-model integration substantially improves discriminative capability and robustness. The three-model combination (Xception + EfficientNetB0 + EfficientNetV2M), as adopted in the proposed framework, consistently achieved the highest classification accuracy. Additionally, interpretability experiments using Grad-CAM demonstrate that MHSA enables the model to more accurately focus on key regions in pathological images, thereby enhancing the transparency and reliability of its decisions.
These findings collectively validate the effectiveness, interpretability, and generalizability of the proposed model for colorectal cancer histopathological image classification.
Extended experiments
To further validate the effectiveness, robustness, and generalizability of the proposed model, we conducted additional experiments on histopathological datasets beyond colorectal cancer. These experiments aimed to assess the model’s ability to generalize across distinct cancer types and tissue structures, thereby demonstrating that the learned domain-specific features are not restricted to colorectal cancer alone.
Two publicly available datasets were used for extended evaluation:
LC25000 dataset36 : This dataset includes 25,000 histopathological images related to lung and colon cancer, divided evenly across five classes (5000 images per class). It comprises both lung tissue categories—normal (lung_n), adenocarcinoma (lung_aca), and squamous cell carcinoma (lung_scc)—and colon tissue categories—normal (colon_n) and adenocarcinoma (colon_aca).
Cervical cell image dataset (Herlev)37 : Widely used in cervical cancer screening research, this dataset contains 917 single-cell images categorized into seven classes. It includes three benign categories (superficial squamous epithelium, intermediate squamous epithelium, and columnar epithelium) and four malignant categories (mild, moderate, and severe squamous non-keratinizing dysplasia, and squamous cell carcinoma).
These extended experiments support the model’s capacity to generalize well across diverse histopathological image domains, reinforcing its applicability to broader medical image classification tasks.
To evaluate the generalizability of the proposed model beyond colorectal cancer, we conducted classification experiments on two additional histopathological datasets: LC25000 and Herlev. These datasets represent different tissue types and classification challenges, providing a robust test of the model’s adaptability. Table 8 summarizes the classification accuracy achieved by the proposed model on both datasets.
Table 8. Classification accuracy (%) comparison on LC25000 and Herlev datasets. The proposed model achieved the highest performance across both datasets. Best results are highlighted in bold.
Reference | Accuracy (%) | |
|---|---|---|
LC25000 | Herlev | |
Mehmood et al.38 | 98.4 | – |
Omar et al.39 | 99.44 | – |
Savaş et al.40 | 99.78 | – |
Rahaman et al.41 | – | 98.32 |
Xue et al.42 | – | 98.37 |
Zhang et al.43 | – | 99.14 |
Our proposed | 99.98 ± 0.02 | 99.67 ± 0.07 |
On the LC25000 dataset, the model attained an accuracy of 99.98%, outperforming previous studies by Mehmood et al.38 (98.40%) and Omar et al.39(99.44%). It also surpassed the performance of Savaş et.al, whose ensemble of multiple CNNs reached an accuracy of 99.78%. The superior performance of our model is attributed to the integration of domain-specific transfer learning and multi-head self-attention mechanism, which were not incorporated in their approach.
Similarly, on the Herlev dataset, it achieved 99.67%, exceeding the performance of Rahaman et al.41. and Xue et al.42 by 1.35% and 1.3%, respectively. It further outperformed the A2SDNet121 model proposed by Zhang et al.43, which integrates four Atrous Dense Blocks and Squeeze-and-Excitation (SE) modules and achieved an accuracy of 99.14%.
These results highlight the model’s strong generalization capability across diverse histopathological image domains and its effectiveness in handling multi-class classification tasks beyond the original colorectal cancer context.
Discussion
Comparison with state-of-the-art models
To comprehensively evaluate the performance of the proposed model, we conducted a comparative analysis with representative recent studies across three colorectal cancer histopathological image datasets. The results are summarized in Table 9.
Table 9. Classification accuracy (%) comparison between the proposed model and recent state-of-the-art methods across three colorectal cancer datasets. The proposed model achieved the highest accuracy in all cases. Best results are highlighted in bold.
Reference | EBHI (200 ×) | Chaoyang | COAD | Model details |
|---|---|---|---|---|
Hu et al.25 | 95.37 | – | – | VGG16 with transfer learning |
Yengec et al.44 | 91.10 | – | – | Ensemble learning with stain normalization |
Bisht et al.45 | 98.6 | – | – | DeepCRC-Net |
Proposed | 99.57 ± 0.06 | – | – | Feature fusion with domain-specific transfer learning |
Zhu et al.26 | – | 83.40 | – | Hard sample aware noise robust learning |
Yong et al.16 | – | 85.69 ± 1.48 | – | Weighted averaging ensemble learning |
He et al.46 | – | 82.4 | – | Prototype-guided Long-Tail Noisy Learning |
Proposed | – | 86.72 ± 1.29 | – | Feature fusion with domain-specific transfer learning |
Tien et al.47 | – | – | 94.70 | Integrated deep learning pipeline |
Bilal et al.48 | – | – | 98.95 | Weakly supervised learning |
Yu et al.49 | – | – | 86.80 ± 0.0073 | Hierarchical attention-based Transformer |
Proposed | – | – | 99.44 ± 0.07 | Feature fusion with domain-specific transfer learning |
On the EBHI dataset, previous studies by Hu et al.25, Yengec et.al.44 and Bisht et al45 used the 200 × subset to classify images into two categories: benign (normal, polyp, and low-grade IN) and malignant (high-grade IN and adenocarcinoma). For consistency, we adopted the same experimental setup. Hu et al.25 utilized transfer learning with VGG16, achieving 95.37% accuracy.Yengec et al.44 employed an ensemble of ConvNeXt-B and ConvNeXt-Tiny, achieving 91.10%. Bisht et.al proposed a dual-track architecture that combines the Xception network with an efficient lightweight local feature-fusion network, achieving an accuracy of 98.6%. In contrast, our proposed model achieved 99.57%, significantly outperforming them.
On the Chaoyang dataset, Zhu et al.26 applied a CNN with noise-robust learning, reaching 83.40%, while Yong et al.16 proposed an ensemble of five CNNs with average weighting, achieving 85.69%. He et al46 proposed a prototype-guided method incorporating adaptive multi-prototype learning and sharpness-aware mixup, achieving an accuracy of 82.4%. Our model surpassed them, achieving 86.72% accuracy.
For the COAD dataset, Tien et al.47 implemented an integrated deep learning pipeline with 94.70% accuracy, and Bilal et al.48 used weakly supervised learning with ResNet34, achieving 98.95%. Yu et al.49 proposed a multi-scale WSI Transformer with cross-scale hierarchical attention, achieving 86.80% accuracy and outperforming existing Transformer- and MIL-based methods. The proposed model outperformed the above with an accuracy of 99.44%.
Overall, the proposed model consistently exceeds the performance of existing state-of-the-art approaches across all datasets. This improvement is primarily attributed to two key innovations:
Domain-specific transfer learning, which enables the model to extract richer, more relevant features through targeted pre-training.
Multi-head self-attention-based feature fusion, which effectively integrates complementary features from multiple CNNs, addressing the limitations of individual models and enhancing overall discriminative capability.
Computational efficiency and model limitations
While the proposed model achieves state-of-the-art classification performance across three primary colorectal cancer datasets and two additional histopathological datasets, it introduces notable computational overhead. These limitations, although expected given the model’s architectural complexity, are important considerations for real-world applicability, particularly in resource-constrained or time-sensitive settings. The following key insights highlight the trade-offs between computational cost and performance, along with their implications for deployment.
Domain-specific transfer learning boosts robustness—but increases training time
A major contributor to this overhead is the domain-specific transfer learning (DTL) strategy. By design, DTL improves generalization by pre-training the model on a source dataset before fine-tuning on the target domain, enabling it to learn more domain-relevant and transferable features. However, this two-stage process is computationally intensive, especially as it must be performed independently for each of the three base CNNs. This results in significantly longer training times compared to conventional single-model approaches. Nonetheless, this time investment is often justified in the medical domain, where labeled data is scarce, and generalization across varying tissue types is critical. The pre-training effort effectively reduces dependence on large annotated target datasets and improves the model’s robustness across datasets with differing characteristics.
Target-domain fine-tuning time increases with multi-model architecture
Quantitative comparisons in Table 10 further highlight this trade-off. On average, target-domain fine-tuning time for the proposed model is 2–3 times that of a single CNN, with per-image-per-epoch processing times ranging from 0.0161 to 0.0225 s, compared to 0.0026 to 0.0109 s for single models across the EBHI, Chaoyang, and COAD datasets. This increase is an expected result of the separate fine-tuning process for each network, followed by integration in the fusion stage.
Table 10. Per-image processing time (in seconds) for different computational stages across datasets. The comparison includes fine-tuning, classification testing, transfer learning, and feature fusion steps for baseline and proposed models.
Models | Different time points | EBHI | Chaoyang | COAD | |||
|---|---|---|---|---|---|---|---|
40 × | 100 × | 200 × | 400 × | ||||
Xception | Target domain fine-tuning | 0.0077 | 0.0078 | 0.0077 | 0.0077 | 0.0109 | 0.0076 |
Classification testing | 0.0039 | 0.0040 | 0.0039 | 0.0039 | 0.0073 | 0.0037 | |
EfficientNetB0 | Target domain fine-tuning | 0.0030 | 0.0030 | 0.0030 | 0.0029 | 0.0041 | 0.0026 |
Classification testing | 0.0020 | 0.0020 | 0.0020 | 0.0020 | 0.0031 | 0.0017 | |
EfficientNetV2M | Target domain fine-tuning | 0.0064 | 0.0065 | 0.0064 | 0.0063 | 0.0075 | 0.0059 |
Classification testing | 0.0030 | 0.0030 | 0.0029 | 0.0028 | 0.0040 | 0.0025 | |
Proposed | Source domain transfer learning (1/2 layers frozen) | 0.0365 | 0.0365 | 0.0365 | 0.0365 | 0.0365 | 0.0365 |
Target domain fine-tuning | 0.0171 | 0.0173 | 0.0171 | 0.0169 | 0.0225 | 0.0161 | |
Multi-model feature fusion | 0.0059 | 0.0060 | 0.0059 | 0.0059 | 0.0067 | 0.0048 | |
Final classification testing | 0.0044 | 0.0044 | 0.0044 | 0.0044 | 0.0051 | 0.0036 | |
Inference time remains practical for clinical deployment
The inference phase, in contrast, remains highly practical. Despite using three CNNs and an MLP classifier with a multi-head self-attention (MHSA) fusion mechanism, the proposed model’s inference time per image stays within the millisecond range—specifically, 0.0036 to 0.0051 s across datasets. These values are only marginally higher than those of EfficientNetB0 and EfficientNetV2M, and are even comparable to Xception, a single but deeper CNN. This shows that although the model is architecturally richer, it still maintains acceptable real-time classification performance, making it deployable in many clinical workflows where image preparation and scanning often dominate time consumption.
MHSA fusion adds minimal time—but unlocks significant gains
The additional time required by the MHSA-based feature fusion stage (0.0048–0.0067 s per image) is modest, yet yields substantial gains in classification accuracy and robustness. MHSA dynamically weights features extracted from each base model, allowing the network to simulate a “committee of experts” by combining low-level texture information and high-level semantic cues. This leads to more discriminative representations, especially useful in noisy or complex datasets, such as Chaoyang. Importantly, the performance gain (~ 2% improvement over the best individual CNNs) is not merely additive but reflects non-linear enhancement, made possible by this targeted fusion approach.
Source pretraining adds cost but aligns models with target tasks
Additionally, the source domain transfer learning stage adds another layer of time overhead, as each CNN is pre-trained with half of its layers frozen. This contributes roughly three times the pre-training cost compared to using a single CNN. However, this investment helps align the models more effectively with the target task, laying a stronger foundation for downstream classification.
A balanced accuracy–efficiency trade-off for clinical AI
When evaluated holistically, the model offers a favourable accuracy-efficiency trade-off. The computational cost, mainly incurred during offline training, is counterbalanced by consistent gains in predictive accuracy, stability, and generalizability. In high-stakes diagnostic environments, these gains can translate into fewer misclassifications, reduced reliance on expert annotations, and increased confidence in AI-assisted decision-making. For institutions with GPU-accelerated infrastructure, the training time is manageable; for low-resource environments, simplified variants (e.g., using two base models or removing MHSA) could offer flexible deployment options.
In conclusion, although the proposed model is not the fastest, it is demonstrably more effective. Its superior classification accuracy, combined with reasonably efficient inference performance, positions it as a practical and scalable solution for real-world histopathological image analysis, particularly in settings where accuracy and reliability are critical.
Although the proposed model delivers notable improvements in classification performance through domain-specific transfer learning and multi-head self-attention, these gains come with additional time and computational overhead. To mitigate these limitations, future work can focus on improving training efficiency and reducing model complexity without compromising accuracy. First, more efficient pre-training strategies (such as knowledge distillation and incremental learning) can be explored to shorten training time and enhance the adaptability of the model to new domains. Second, model compression techniques like pruning or quantization may be employed to reduce the overall number of parameters, lowering computational cost while maintaining robust performance. Additionally, adopting parallel programming techniques for pre-training multiple CNNs simultaneously can significantly reduce total execution time and make the pipeline more scalable. These directions hold promise for improving the model’s practical applicability in time-sensitive or resource-limited diagnostic settings.
Conclusion
This study presents a novel histopathological image classification framework that integrates domain-specific transfer learning with multi-model feature fusion, enhanced by a multi-head self-attention mechanism, specifically tailored for colorectal cancer (CRC) analysis. By pre-training five deep neural networks on a domain-relevant source dataset and selecting the top three (Xception, EfficientNetB0, EfficientNetV2M) for fusion, the model captures rich, complementary features that are further refined through attention-based weighting and classified via an MLP classifier .Extensive experiments across three benchmark datasets demonstrate the model’s superior performance, achieving up to 99.68% accuracy on EBHI, 86.72% on the noisy and imbalanced Chaoyang dataset, and 99.44% on COAD for fine-grained subtype classification. These results highlight the model’s robustness in handling multi-scale variability, class imbalance, label noise, and subtype granularity. Overall, the proposed approach significantly outperforms existing methods, offering a powerful, generalizable solution for automated CRC histopathological image classification—contributing toward more accurate, efficient, and scalable cancer diagnosis.
Acknowledgements
The authors sincerely acknowledge the generous support from the Guangxi Key Laboratory of Big Data for providing the necessary equipment and funding for this project.
Author contributions
Conception and design: Q Ke, YC Hum, WS Yap; Administrative support: YC Hum, WS Yap, YJ Gan; Provision of study materials or patients: Q Ke, AQ Li, R Gao; Collection and assembly of data: Q Ke, YC Hum, YJ Gan, TS Tan; Data analysis and interpretation: Q Ke, AQ Li, HN, HM; Manuscript writing: All authors; Final approval of manuscript: All authors.
Funding
This study was supported by Guangxi First-Class Discipline Statistics Construction Project Fund, and Guangxi Higher Education Institutions Young and Middle-aged Teachers’ Basic Research Capacity Enhancement Project (No. 2024KY0669).
Data availability
All datasets used in this study are publicly accessible. EBHI: Available at https://doi.org/10.6084/m9.figshare.16999363.v1 Chaoyang: Available at https://bupt-ai-cz.github.io/HSA-NRL/ COAD, NCT-CRC-HE-100 K, LC25000, and Herlev: Accessible via https://www.kaggle.com/datasets. For any further information or requests regarding the data used in this study, please contact the corresponding author Yan Chai Hum at [email protected].
Declarations
Competing interests
The authors declare no competing interests.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1. Chhikara, B. S. & Parang, K. Chemical biology LETTERS global cancer statistics 2022: the trends projection analysis. Chem. Biol. Lett. 451 (2023).
2. Ferlay, J et al. Cancer statistics for the year 2020: An overview. Int. J. Cancer.; 2021; 149, pp. 778-789.1:CAS:528:DC%2BB3MXptlGhsbw%3D [DOI: https://dx.doi.org/10.1002/ijc.33588]
3. Li, X et al. A comprehensive review of computer-aided whole-slide image analysis: from datasets to feature extraction, segmentation, classification and detection approaches. Artif. Intell. Rev.; 2022; 55, pp. 4809-4878. [DOI: https://dx.doi.org/10.1007/s10462-021-10121-0]
4. Li, J et al. DARC: Deep adaptive regularized clustering for histopathological image classification. Med. Image Anal.; 2022; 80, [DOI: https://dx.doi.org/10.1016/j.media.2022.102521] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35780594]102521.
5. Ben Hamida, A et al. Deep learning for colon cancer histopathological images analysis. Comput. Biol. Med.; 2021; 136, 1:STN:280:DC%2BB2cvivFGksA%3D%3D [DOI: https://dx.doi.org/10.1016/j.compbiomed.2021.104730] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34375901]104730.
6. Wang, Q. et al. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11534–11542 (2020).
7. Zhang, Q.-L. & Yang, Y.-B. SA-Net: Shuffle attention for deep convolutional neural networks. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2235–2239 (2021).
8. Koo, J. C. et al. Deep machine learning histopathological image analysis for renal cancer detection. In Proceedings of the 8th International Conference on Computing and Artificial Intelligence 657–663 (ACM, New York, NY, USA, 2022).
9. Karthikeyan, A; Jothilakshmi, S; Suthir, S. Colorectal cancer detection based on convolutional neural networks (CNN) and ranking algorithm. Meas. Sens.; 2024; 31, [DOI: https://dx.doi.org/10.1016/j.measen.2023.100976] 100976.
10. Kumar, A; Vishwakarma, A; Bajaj, V. CRCCN-Net: Automated framework for classification of colorectal tissue using histopathological images. Biomed. Signal Process. Control; 2023; 79, [DOI: https://dx.doi.org/10.1016/j.bspc.2022.104172] 104172.
11. Neto, PC et al. An interpretable machine learning system for colorectal cancer diagnosis from pathology slides. NPJ Precis. Oncol.; 2024; 8, 56.1:CAS:528:DC%2BB2cXlvFehu70%3D [DOI: https://dx.doi.org/10.1038/s41698-024-00539-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38443695][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10914836]
12. Voon, W; Hum, YC; Tee, YK et al. Evaluating the effectiveness of stain normalization techniques in automated grading of invasive ductal carcinoma histopathological images. Sci. Rep.; 2023; 13, 20518.2023NatSR.1320518V1:CAS:528:DC%2BB3sXisVertL7K [DOI: https://dx.doi.org/10.1038/s41598-023-46619-6] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37993544][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10665422]
13. Chen, W; Li, X; Gao, L; Shen, W. Improving computer-aided cervical cells classification using transfer learning based snapshot ensemble. Appl. Sci.; 2020; 10, 7292.1:CAS:528:DC%2BB3cXisVSjurzK [DOI: https://dx.doi.org/10.3390/app10207292]
14. D’souza, RN; Huang, P-Y; Yeh, F-C. Structural analysis and optimization of convolutional neural networks with a small sample size. Sci. Rep.; 2020; 10, 834.2020NatSR.10.834D [DOI: https://dx.doi.org/10.1038/s41598-020-57866-2] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31965034][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6972775]
15. Cherti, M. & Jitsev, J. Effect of pre-training scale on intra- and inter-domain, full and few-shot transfer learning for natural and X-Ray chest images. In 2022 International Joint Conference on Neural Networks (IJCNN) 1–9 (IEEE, 2022).
16. Yong, MP et al. Histopathological cancer detection using intra-domain transfer learning and ensemble learning. IEEE Access; 2023; 12, pp. 1434-1457. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3343465]
17. Wang, J., Chen, Y., Yu, H., Huang, M. & Yang, Q. Easy transfer learning by exploiting intra-domain structures. In 2019 IEEE International Conference on Multimedia and Expo (ICME) 1210–1215 (2019).
18. Cao, Z et al. A novel colorectal histopathological image classification method based on progressive multi-granularity feature fusion of patch. IEEE Access; 2024; 12, pp. 68981-68998. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3401240]
19. Liang, M; Ren, Z; Yang, J; Feng, W; Li, B. Identification of colon cancer using multi-scale feature fusion convolutional neural network based on shearlet transform. IEEE Access; 2020; 8, pp. 208969-208977. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3038764]
20. Li, S et al. Multi-scale high and low feature fusion attention network for intestinal image classification. Signal Image Video Process.; 2023; 17, pp. 2877-2886. [DOI: https://dx.doi.org/10.1007/s11760-023-02507-0]
21. Pedro, R. & Oliveira, A. L. Assessing the impact of attention and self-attention mechanisms on the classification of skin lesions. In 2022 International Joint Conference on Neural Networks (IJCNN) 1–8 (IEEE, 2022).
22. Vaswani, A. et al. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 30, (2017).
23. Wang, Y et al. Arrhythmia classification algorithm based on multi-head self-attention mechanism. Biomed. Signal Process. Control; 2023; 79, [DOI: https://dx.doi.org/10.1016/j.bspc.2022.104206] 104206.
24. Huang, N-Y; Liu, C-X. Efficient tumor detection and classification model based on ViT in an end-to-end architecture. IEEE Access; 2024; 12, pp. 106096-106106. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3424294]
25. Hu, W. et al. EBHI: A new enteroscope biopsy histopathological H&E image dataset for image classification evaluation. Physica Medica 107, (2023).
26. Zhu, C; Chen, W; Peng, T; Wang, Y; Jin, M. Hard sample aware noise robust learning for histopathology image classification. IEEE Trans. Med. Imaging; 2021; 41, pp. 881-894.2022ITMI..41.881Z [DOI: https://dx.doi.org/10.1109/TMI.2021.3125459]
27. Jiao, Y; Li, J; Qian, C; Fei, S. Deep learning-based tumor microenvironment analysis in colon adenocarcinoma histopathological whole-slide images. Comput. Methods Programs Biomed.; 2021; 204, [DOI: https://dx.doi.org/10.1016/j.cmpb.2021.106047] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33789213]106047.
28. Kather J N, Halama N & Marx A. 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo10 5281, (2018).
29. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
30. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on coMputer Vision and Pattern Recognition 4700–4708 (2017).
31. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (2017).
32. Tan, M. & Le, Q. V. EfficientNet: Rethinking model scaling for convolutional neural Networks. In International Conference on Machine Learning 6015–6114 ( PMLR, 2019).
33. Tan, M. & Le, Q. V. EfficientNetV2: Smaller models and faster training. In International Conference on Machine Learning 10096–10106 (PMLR, 2021).
34. Pan, SJ; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng.; 2010; 22, pp. 1345-1359. [DOI: https://dx.doi.org/10.1109/TKDE.2009.191]
35. Khan, SI; Shahrior, A; Karim, R; Hasan, M; Rahman, A. MultiNet: a deep neural network approach for detecting breast cancer through multi-scale feature fusion. J. King Saud Univ. Comput. Inf. Sci.; 2022; 34, pp. 6217-6228. [DOI: https://dx.doi.org/10.1016/j.jksuci.2021.08.004]
36. Borkowski, A. A. et al. Lung and Colon Cancer Histopathological Image Dataset (LC25000). arXiv preprint (2019).
37. Jantzen, J. The pap smear benchmark. Nature Inspired Smart Information Systems 1–9 (2006).
38. Mehmood, S et al. Malignancy detection in lung and colon histopathology images using transfer learning with class selective image processing. IEEE Access; 2022; 10, pp. 25657-25668. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3150924]
39. Omar, L. Th., Hussein, J. M., Omer, L. F., Qadir, A. M. & Ghareb, M. I. Lung and colon cancer detection using weighted average ensemble transfer learning. In 2023 11th International Symposium on Digital Forensics and Security (ISDFS) 1–7 (IEEE, 2023).
40. Savaş, S; Güler, O. Ensemble learning based lung and colon cancer classification with pre-trained deep neural networks. Health Technol. (Berl); 2025; 15, pp. 105-117. [DOI: https://dx.doi.org/10.1007/s12553-024-00911-1]
41. Rahaman, MM et al. DeepCervix: a deep learning-based framework for the classification of cervical cells using hybrid deep feature fusion techniques. Comput. Biol. Med.; 2021; 136, [DOI: https://dx.doi.org/10.1016/j.compbiomed.2021.104649] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34332347]104649.
42. Xue, D et al. An application of transfer learning and ensemble learning techniques for cervical histopathology image classification. IEEE Access; 2020; 8, pp. 104603-104618. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2999816]
43. Zhang, Y; Ning, C; Yang, W. An automatic cervical cell classification model based on improved DenseNet121. Sci. Rep.; 2025; 15, 3240.2025NatSR.15.3240Z1:CAS:528:DC%2BB2MXisFGntrY%3D [DOI: https://dx.doi.org/10.1038/s41598-025-87953-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39863704][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11762993]
44. Yengec-Tasdemir, SB; Aydin, Z; Akay, E; Dogan, S; Yilmaz, B. Improved classification of colorectal polyps on histopathological images with ensemble learning and stain normalization. Comput. Methods Programs Biomed.; 2023; 232, [DOI: https://dx.doi.org/10.1016/j.cmpb.2023.107441] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36905748]107441.
45. Singh Bisht, A; Ajay, A; Karthik, R. DeepCRC-Net: an attention-driven deep learning network for colorectal cancer classification using xception and efficient lightweight local feature fusion networks. IEEE Access; 2025; 13, pp. 49362-49374. [DOI: https://dx.doi.org/10.1109/ACCESS.2025.3550004]
46. He, Z et al. PLTN: Noisy label learning in long-tailed medical images with adaptive prototypes. Neurocomputing; 2025; 645, [DOI: https://dx.doi.org/10.1016/j.neucom.2025.130514] 130514.
47. Tien, G-Y et al. Prediction of gene mutation from colorectal adenocarcinoma whole slide images via integrated deep learning pipeline. Cancer Res.; 2024; 84, pp. 4938-4938. [DOI: https://dx.doi.org/10.1158/1538-7445.AM2024-4938]
48. Bilal, M. et al. AI based pre-screening of large bowel cancer via weakly supervised learning of colorectal biopsy histology images. medRxiv 22271565 (2022).
49. Yu, J et al. HiViT: Hierarchical attention-based Transformer for multi-scale whole slide histopathological image classification. Expert Syst. Appl.; 2025; 277, [DOI: https://dx.doi.org/10.1016/j.eswa.2025.127164] 127164.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Colorectal cancer (CRC) poses a significant global health burden, where early and accurate diagnosis is vital to improving patient outcomes. However, the structural complexity of CRC histopathological images renders manual analysis time-consuming and error-prone. This study aims to develop an automated deep learning framework that enhances classification accuracy and efficiency in CRC diagnosis. The proposed model integrates domain-specific transfer learning and multi-model feature fusion to address challenges such as multi-scale structures, noisy labels, class imbalance, and fine-grained subtype classification. The model first applies domain-specific transfer learning to extract highly relevant features from histopathological images. A multi-head self-attention mechanism then fuses features from multiple pre-trained models, followed by a multilayer perceptron (MLP) classifier for final prediction. The framework was evaluated on three publicly available CRC datasets: EBHI, Chaoyang, and COAD. The model achieved a classification accuracy of 99.68% on the EBHI dataset (200 × subset), 86.72% on the Chaoyang dataset, and 99.44% on the COAD dataset. These results demonstrate strong generalization across diverse and complex histopathological image conditions. This study highlights the effectiveness of combining domain-specific transfer learning with multi-model feature fusion and attention mechanisms for CRC classification. The proposed model offers a reliable and efficient tool to support pathologists in diagnostic workflows, with the potential to reduce manual workload and improve diagnostic consistency.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 School of Big Data and Artificial Intelligence, Guangxi University of Finance and Economics, Nanning, China (ROR: https://ror.org/02ayg6516) (GRID: grid.453699.4) (ISNI: 0000 0004 1759 3711); Department of Mechatronics and Biomedical Engineering, Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Kampar, Malaysia (ROR: https://ror.org/050pq4m56) (GRID: grid.412261.2) (ISNI: 0000 0004 1798 283X); Department of Electrical and Electronic Engineering, Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Kampar, Malaysia (ROR: https://ror.org/050pq4m56) (GRID: grid.412261.2) (ISNI: 0000 0004 1798 283X)
2 Department of Mechatronics and Biomedical Engineering, Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Kampar, Malaysia (ROR: https://ror.org/050pq4m56) (GRID: grid.412261.2) (ISNI: 0000 0004 1798 283X)
3 Department of Electrical and Electronic Engineering, Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Kampar, Malaysia (ROR: https://ror.org/050pq4m56) (GRID: grid.412261.2) (ISNI: 0000 0004 1798 283X)
4 Department of Biomedical Engineering & Health Sciences, Faculty of Electrical Engineering, Universiti Teknologi Malaysia, UTM Johor Bahru, 81310, Johor Bahru, Johor, Malaysia (ROR: https://ror.org/026w31v75) (GRID: grid.410877.d) (ISNI: 0000 0001 2296 1505)
5 Department of Electronic Engineering, Faculty of Engineering and Green Technology, Universiti Tunku Abdul Rahman, Kampar, Malaysia (ROR: https://ror.org/050pq4m56) (GRID: grid.412261.2) (ISNI: 0000 0004 1798 283X)
6 Department of Computer Science, Electrical and Space Engineering, Lulea University of Technology, Luleå, Sweden (ROR: https://ror.org/016st3p78) (GRID: grid.6926.b) (ISNI: 0000 0001 1014 8699)
7 School of Big Data and Artificial Intelligence, Guangxi University of Finance and Economics, Nanning, China (ROR: https://ror.org/02ayg6516) (GRID: grid.453699.4) (ISNI: 0000 0004 1759 3711)
8 School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, Northern Ireland, UK (ROR: https://ror.org/00hswnk62) (GRID: grid.4777.3) (ISNI: 0000 0004 0374 7521)




