Hybrid deep learning framework based on

Full text

Turn on search term navigation

Introduction

Gastrointestinal (GI) disorders are among the severe diseases that plague the world. These include tumors, ulcerative colitis, irritable bowel syndrome, hemorrhoids, Helicobacter pylori infections, Crohn’s disease, polyps, and colorectal cancer. They are regarded as the leading causes of morbidity and mortality worldwide. According to the American Cancer Society, in 2016, approximately 76,940 people died due to stomach cancers in the United States. Early and Accurate diagnosis of GI disorders is crucial for timely intervention and enhanced recovery of patients. During such diagnostic procedures, endoscopy is an essential tool for visualizing the inside wall of the GI tract to highlight abnormalities such as lesions, ulcers, or tumors¹. However, the subjective nature of endoscopic diagnosis and the vast amount of data generated at the time of the procedure make it challenging for doctors. Recent advances in artificial intelligence, especially deep learning (DL) have been promising solutions for these diagnostic challenges². Research on CAD tools is gaining momentum by employing deep learning algorithms for the automated detection and classification of gastrointestinal abnormalities, and CNNs remain the key technology in this domain due to their superior ability to analyze and interpret such complex medical images³. CNNs, trained on large datasets, can identify subtle patterns in endoscopic images, enhancing traditional automated GI diagnosis methods that rely on handcrafted features⁴. Accuracy in capturing diverse data sets and greater resolution are aspects of DL-based CAD tools while using endoscopic video captures towards comprehensive analysis, overcoming the defects of previous automated methods⁵. WCE enhances diagnostic innovations by combining space features with that of sequence-of-events for specific abnormal features including polyps as well as ulcers, making it a traditional endoscopy complement. In this procedure, patients undergo a minimally invasive process where they take a capsule containing a camera and a transmitter; it captures as many images as 120,000 of the GI tracts⁶. This method effectively visualizes the small intestine, which is hard to access with conventional endoscopy. Only 5–7% of the images usually have clinically relevant findings. Thus, manual review is time-consuming and prone to errors. CAD systems based on CNNs provide an efficient solution by automatically analyzing WCE images to identify and classify abnormalities with high accuracy. DL-based CAD tools extend beyond endoscopy, impacting a wide range of medical imaging applications, including histopathology, MRI, and CT scans. These systems also go so far in improving diagnosis accuracy, besides reducing the cognitive load on doctors. For instance, DL techniques have recently been used to extract some latent data, like molecular expression data from medical images. This could never have been interpreted without DL, but it becomes possible here. Importantly, it requires no collection of further samples because this data was part of the routine clinical process^7,8.

Although remarkable advancements have been achieved, existing DL-based CAD tools for GI diagnostics are still not perfect. Most methods have been focused on specific conditions, such as tumor or polyp detection, and fail to cover the whole spectrum of GI disorders. Besides, some tools only focus on spatial information from static images, which is an essential part of endoscopic videos with valuable temporal information. Thus, by combining spatial and temporal analysis, future CAD systems can reach even higher diagnostic accuracy and offer more robust support for clinical decision-making⁹.

The integration of AI into GI diagnostics is further highlighted by recent regulatory milestones, such as the Breakthrough Device Designation awarded to a DL-based real-time diagnostic software for gastric cancer by the United States Food and Drug Administration (FDA). It reflects growing confidence in the ability of AI technologies to transform medical practice. These tools, as they continue to develop, promise both to improve the efficiency of diagnosis and to democratize access to quality healthcare through the reduction of variability in medical expertise¹⁰. These technologies are going to revolutionize the detection and management of GI diseases and, therefore, improve patient outcomes and reduce the global burden associated with these conditions. ViT are novel end-to-end approaches to GI diagnostics that rectify the limitations of traditional CNNs when processing complex visual data such as endoscopic videos and WCE images. Although CNNs are restricted with fixed-size filters and often fail to capture global contextual relationships, ViT concentrates on relevant regions using self-attention mechanisms while maintaining the overall data structure. This greater awareness combined with the capacity to integrate spatial and temporal information makes ViT highly effective at detecting dynamic GI abnormalities such as bleeding, ulcers, polyps, and tumors. ViT work best on large datasets, in which they automatically learn features without the need for manual input, and they can generalize well across different imaging types and patient populations. They also provide considerable advancements in the areas of polyp and lesion detection, ulcer classification, cancer screening, and the temporal analysis of WCE videos. Still, challenges persist in the form of high data requirements, computational complexity, and the need for interpretability. However, ViTs show much promise to enhance diagnostic accuracy, streamline workflow, and finally improve patient outcomes in GI diagnostics, opening doors to more reliable and innovative CAD systems. The goal of this study is to combine the strengths of advanced deep learning models while addressing their limitations. We achieved this by integrating EfficientNetB0 and MobileNet with ViT in a framework designed to classify GI diseases effectively. Here’s what we’ve contributed:

We developed a hybrid model EfficientViT capable of classifying eight types of GI diseases. The base model EfficientNetB0 is used for high local features extraction and ViT transformer is used for improved global context features in experiment-I and in experiment-II we replace the base model EfficientNetB0 with MobileNet.
Our approach combines the powerful feature extraction of EfficientNetB0 and MobileNet for extracts local spatial attention and the multi head attention of ViT block provides global attention to the different regions of the GI disease cells.
At last, we fused the feature of base model with transformer encoder features and apply the dense layer for classification. EfficientViT excels at identifying subtle differences in GI disease features, making it highly effective in achieving high accuracy. Compared to state-of-the-art algorithms, our models consistently outperformed them, demonstrating overall improved accuracy and strong generalizability on the GI endoscopic Kvasir dataset.

Related work in the context of GI disease classification is discussed in Section"Related work". We discuss the architecture of the Hybrid model EfficientViT in Section"Research methodology". The findings of the experiment are illustrated in Section"Results", with a discussion and its implications in Section"Discussion and analysis of results"and a conclusion on the future scope in Section"Conclusion".

Related work

The application of deep learning (DL) and artificial intelligence (AI) in the diagnosis of gastrointestinal (GI) abnormalities has significantly advanced over the last few years. With an increasing prevalence of GI disorders, the need for accurate, efficient, and early detection methods is crucial. Traditional diagnostic techniques, though effective, often involve complex procedures that are time-consuming and require expertise, thus making AI-assisted approaches essential. Obayya et al.⁸ proposed MSSADL-GITDC: Modified Salp Swarm Algorithm integrated with a Deep Learning framework for the classification of gastrointestinal tract disease using WCE images. The classification was done using the DBN-ELM model with the Extreme Learning Machine. It is fine-tuned with backpropagation. It has been validated on the Kvasir-V2 dataset. Classification accuracy is 98.03%.

Fati et al.⁹ proposed multiple hybrid diagnostic systems based on endoscopic images for the early diagnosis of lower GI tract diseases. The techniques involved pre-trained CNNs (GoogLeNet and AlexNet) for feature extraction and classification. The hybrid model combining GoogLeNet with LBP, GLCM, and FCH achieved the highest accuracy of 99.3% on the datasets. Owais et al.¹⁰ proposed a two-stage diagnostic framework that coupled a classification network with a case retrieval system. The first stages are classified as the disease type. Then, the second will search for similar cases in its database to aid subjective validation that could be performed by medical persons. The framework was tested by two datasets that contain 52,471 endoscopic frames, giving an accuracy of 96.19% and an F1-score of 96.99. The proposed method showed superiority in diagnostic efficiency over classes of 37 GI diseases. Attallah et al.¹¹ proposed GASTRO-CADx, a three-stage framework combining CNNs with Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) for feature extraction and reduction. The final stage includes feature fusion to optimize the outcome of classification. GASTRO-CADx obtained accuracies of 97.3% and 99.7% with two datasets, surpassing other techniques in GI disease classification. Iqbal et al.¹² developed multi-route convolutional layers and resolution-dependent image resolution architecture for the DCNN for the enhancement of accuracy in classification. It achieved an MCC equal to 0.9743 with superior performance on the automated detection of abnormality in the GI tract while testing the model on the Kvasir dataset. Komeda et al.¹³ applied Residual Networks (ResNet) to classify adenomatous and non-adenomatous colorectal polyps from 127,610 annotated endoscopic images. The model achieved a sensitivity of 98.8% and an accuracy of 92.8%, effectively supporting AI-based GI diagnostics.

Wang et al.¹⁴ proposed a two-stage classification scheme, where CNNs were used along with Capsule Networks, to tackle the issues of deformation invariance in images obtained during gastrointestinal endoscopy. On the KvasirV2 and Hyper Kvasir datasets, it yielded an accuracy of 94.83% and 85.99%, respectively, hence proving good classification results. Hossain et al.¹⁵ proposed Deep Poly, a system combining Double U-Net for polyps’ segmentation and Vision Transformer (ViT) for risk assessment. The model achieved a mean Dice Coefficient of 0.956 for segmentation on the Kvasir-SEG dataset and classified polyps with 99% accuracy, supporting autonomous colorectal cancer (CRC) screening. Naz et al.¹⁶ proposed a hybrid method for the recognition and classification of GI abnormalities using deep learning and feature optimization techniques. Their method is based on transfer learning using ResNet18 and XcepNet23 models. The suggested method attains accuracy as high as 100% and 99.24% on the Hybrid and KvasirV1 datasets, respectively. Mahmood et al.¹⁷ also presented a deep learning-based model, the GI Disease-Detection Network, which is designed specifically for the classification of peptic ulcers and other GI tract disorders using WCE images. The model gains an accuracy of 98.9% and proves to be robust in identifying various GI disorders compared to the traditional models. Yet another innovative approach is found in the work of Buendgens et al.¹⁸, wherein weakly supervised AI for analysis of GI endoscopy was utilized. The model obtained high performance, with AUROC scores over 0.70 for 13 diseases and above 0.80 for two diseases. For GI cancer early detection, Wafa et al.¹⁹ designed a deep-learning approach integrating CNNs and RNN, to classify MSI and MSS patterns in histopathology data. Their CNNs-Simple RNN-LSTM-GRU hybrid model achieved a 99.90% accuracy level that sets a new benchmark for the early detection of GI cancer and points to deep learning as a future resource in personalized treatment planning.

Xiao et al.²⁰ designed a deep learning-based framework for capsule gastroscope image classification into three categories, including normal gastroscopic images, chronic erosive gastritis, and ulcer gastric images. By using the pre-trained models VGG-16, ResNet-50, and Inception V3 fine-tuned on the specific classification task, the proposed system achieved 94.80% accuracy. In gastrointestinal cancer, Almarshad et al.²¹ proposed a computer-aided diagnosis model based on Capsule Networks (CapsNet), optimized using a snake optimization algorithm (SOADL-GCC). The proposed SOADL-GCC model has achieved a significant classification accuracy of 99.72% on the Kvasir dataset, outperforming other SOTA models. Khan et al.²² proposed a framework for analyzing gastrointestinal diseases from the WCE images. Their system involved feature fusion and selection from a fine-tuned VGG16 model with transfer learning, and they used saliency in detecting ulcers using their features, which are extracted using the Particle Swarm Optimization technique. The accuracy reached 98.4%, higher than any established state-of-the-art techniques. In Tables 1, 2 and 3, we have summarised the literature work.

Table 1. CNN-based models for GI disease classification.

Author	Disease type	Model/technique	Dataset	Accuracy (%)
Komeda et al.¹³	Colorectal polyps	ResNet	127,610 images	92.80
Attallah et al.¹¹	GI diseases	CNN + DWT + DCT (GASTRO-CADx)	2 datasets	97.3, 99.7
Iqbal et al.¹²	GI tract abnormalities	Deep CNN with multi-route conv layers	Kvasir	MCC: 0.9743
Wang et al.¹⁴	GI diseases	CNN + Capsule network	Kvasir-v2, Hyper Kvasir	94.83/85.99
Xiao et al.²⁰	Gastric diseases	VGG16, ResNet50, InceptionV3 (TL models)	380 img/lass	94.80
Mahmood et al.¹⁷	Peptic ulcers	CNN (GIDD-Net)	Kvasir-Capsule	98.90

Table 2. Transformer-based or transformer-dominant models.

Author	Disease type	Model/technique	Dataset	Accuracy (%)
Hossain et al.¹⁵	Colorectal polyps	Double U-Net (seg.) + ViT (risk assess.)	Kvasir-SEG	99.00
Buendgens et al.¹⁸	Various GI diseases	Weakly supervised AI (transformer focus)	48,000 clinical images	AUROC > 0.70

Table 3. Hybrid CNN–transformer or optimized ensemble models.

Author	Disease type	Model/Technique	Dataset	Accuracy (%)
Obayya et al.⁸	GI Tract disease	MSSADL-GITDC (DL + modified swarm)	Kvasir-V2	98.03
Fati et al.⁹	Lower GI diseases	CNN + ANN + LBP/GLCM/FCH	Not available	99.30
Owais et al.¹⁰	Multiple GI conditions	CNN + Retrieval system	52,471 frames	96.19
Naz et al.¹⁶	GI abnormalities	ResNet18 + XcepNet23 (Hybrid)	KvasirV1 + Hybrid	99.24
Wafa et al.¹⁹	GI cancer	CNN + RNN-LSTM-GRU	TCGA, Kaggle	99.90
Almarshad et al.²¹	GI cancer	CapsuleNet + Snake Optimizer (SOADL-GCC)	Kvasir	99.72
Khan et al.²²	GI diseases	VGG16 + PSO for Feature Selection	Private WCE dataset	98.40
Faysal et al.³²	GI diseases	PD-CNN-PCC-EELM	8,000 images GastroVision	98.92
Faysal et al.³³	GI diseases	PSE-CNN-PCA-DELM	GastroVision	97.24
Hussain et al.³⁴	Brain tumor	EFFResNet-ViT	Brain Tumor CE-MRI; Retinal	99.31%; 92.54%
Hussain et al.³⁵	skin lesions	MAGRes-UNet	BT T1-CE-MRI; HAM10000	98.46
Hussain et al.³⁶	colorectal cancer (CRC)	DCSSGA-UNet	CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG	mIoU: 95.67%, 92.39%, 93.97%; mDice: 98.85%, 95.71%, 96.10%

However, a critical gap persists in terms of generalizability, local–global feature integration, and balanced performance across multiple, visually similar GI disease categories. Our research addresses this gap by introducing EfficientViT, a hybrid deep learning framework that strategically combines the local feature extraction capability of EfficientNetB0 with the global attention mechanism of Vision Transformers (ViT).

While prior works have shown success either using CNNs or ViTs independently, most are limited to either spatial feature extraction or global pattern recognition. Furthermore, many existing methods are tailored to specific disease types or datasets, potentially limiting their robustness when classifying diverse GI conditions from endoscopic imagery.

Our proposed model overcomes these limitations in the following ways:

By separating the features map of into two streams (local via EfficientNetB0, and global via ViT encoder), we ensure simultaneous capture of fine-grained textures and contextual patterns, addressing the subtle distinctions among closely related GI classes.
Unlike simple concatenation seen in previous models, our fusion strategy tightly integrates the spatial and attention-based features before classification, which is validated through our ablation study (Section“Ablation study”).
We demonstrate consistently high performance across all eight GI disease categories in the Kvasir dataset, achieving an average accuracy of 99.82% with minimal variance across folds, outperforming state-of-the-art models (Table 6).
Our work not only reports standard metrics but also includes class-wise performance, confusion matrices, and ROC analysis to show robustness, especially in differentiating diseases with overlapping visual traits (like normal-z-line vs. esophagitis).

Research methodology

In this section, we proposed the hybrid DL framework combining the power of EfficientNet-B0 and Vision Transformer as shown in the architecture in Fig. 1 to categorize eight varieties of gastrointestinal GI diseases Therefore, this system uses the best that each model can provide and drives state-of-the-art performance toward the task in the question of classification for GI disease.

[See PDF for image]

Fig. 1

Architecture of the proposed method for the gastrointestinal diseases.

Figure 1 represents the complete methodology of our experiment from data collection to pre-processing of the data to normalise the image to 224 *224 pixels so that our model can process it well. After that, the CNN-based transfer learning model used EfficientNetB0 for spatial and local feature extraction, and then we used that feature map as input to the transform block for global dependencies, as shown in Fig. 1. Before giving the features map to the encoder block, we apply the convolution embedding and positional embedding and divide the T (Q, K, V) into two parts one is processed through the encoder block and another one is through EfficientNetB0 and the last we fussed both model features and the give it to classification head for classifying the disease.

Figure 2 is the sequence flow chart of the proposed architecture, in which the internal working of the hybrid model shows how every module interacts with each other. Based on the ability of EfficientNet-B0 to optimize performance with fewer parameters, a convolutional neural network was selected for its efficiency and capability to extract high-quality spatial features. Compound scaling was used to balance network depth, width, and resolution, which is important for medical image analysis due to its computational efficiency. MobileNet was another lightweight CNN used for efficiency in mobile and resource-constrained environments. The depth-wise separable convolutions in MobileNet make it able to process medical images with efficiency while maintaining a low computational complexity. Both models are highly capable of extracting fine spatial features from medical images, including texture, color, and structure changes. To add to the above feature extraction ability, the ViT was introduced into the model. In contrast to CNNs that depend on local filters, ViT depends on self-attention mechanisms, which capture the global contextual relationship within an image. This makes it particularly effective in distinguishing subtle differences in endoscopic images, such as the morphological changes associated with GI diseases. Features extracted by EfficientNet-B0 and MobileNet were passed through a transformer encoder in ViT, which utilized self-attention mechanisms to refine these features further. This allowed the model to focus on the most relevant aspects of the images while ignoring irrelevant noise, thereby improving classification accuracy.

[See PDF for image]

Fig. 2

Flow chart of the proposed architecture’s internal working.

EfficientNet-B0

EfficientNet is the CNN family with smaller dimensions and fewer computational resources required as compared to previous architectures without compromising on high performance. It was released by Google Research in their paper EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks in 2019, with a novel scaling method as its foundation. This method uniformly adjusts all dimensions depth, width, and resolution—using a compound coefficient, thus allowing the model to achieve better efficiency and accuracy. EfficientNetB0 is designed for high performance and efficient parameter usage. Its architecture is defined by compound scaling, which scales the network’s depth d, width w, and resolution r. EfficientNetB0 is built using mobile inverted bottleneck convolution layers.

The input is an image I ∈ ^R224×224×3 where 224 × 224 is the spatial resolution, and 3 represents RGB color channels.

The scaling used in the EfficientNetB0 design describes how to balance the depth, width, and resolution of a CNN to achieve optimal performance while maintaining a fixed computational cost as shown in Eq. 1.

Where W₁ is the filter kernel. b₁ is the bias term. F₁ is the output feature map. EfficientNetB0 extracts features from a series of depth-wise separable convolutions and MBConv layers. Each MBConv block applies the Depth-wise, Pointwise, and Activation squeeze blocks as shown in Eqs. 2, 3, and 4.

These blocks of F_d (Depth-wise), F_p(Pointwise), and F_SE(Activation Squeeze) are repeated across different stages, and each works as input to the others as shown in Eqs. 2, and 3,4. The final output is a features map F_EffNetB0 calculated as shown in Eq. 5.

I is the input image tensor F_SE is the output of MBCov blocks and θ parameters of efficientNetB0.

Vision transformer (ViT)

ViT is a transformer model specifically designed for computer vision. Unlike regular transformers, which tokenize text into tokens, ViT splits the input picture into patches, maps each patch into a vector, and reduces its dimensionality with one matrix multiplication. The vector embeddings are then fed to a transformer encoder, which, in this case, treats the vector embeddings just like token embeddings. It emerged as an alternative to CNNs in computer vision applications, with different inductive biases, training stabilities, and data efficiencies. First, the input images is divided into patches of size P × P and each patch is flattened and embedded into a vector of size D and positional embedding is added as shown in Eq. 6.

This maps the spatial dimensions into token embeddings T ∈ R^N×d where N is the number of tokens, and d is the embedding size. The embedded patches are passed through transformer blocks, which consist of multi-head self-attention (MSA) and feed-forward networks (FFN), as shown in Eqs. 4 and 5.

LN is layer normalization. The ViT outputs a classification token (CLS) or patch embeddings for further processing. But in this model, before giving T into MSA, we divide it into parts q1 and q2.

Combining EfficientNetB0 and ViT (EfficientViT)

The integration of EfficientNet-B0 and ViT in transfer learning involves using EfficientNet as a feature extractor, as shown in Eq. 5, followed by ViT for classification or additional processing. F_effnet serves as input to the ViT. The feature maps F_effnet are split into patches for the ViT is called tokenization as shown in Eq. 6.

Initially, feature pools derived from the Efficient model are transmitted to the 2D convolution block for creating tokens, using Eq. (6). From the tokens query (Q), Key (V) and value (V) are generated as shown in Eq. (9).

Where WQ, WK, and WV are learned projection matrices. Further, the query Q is portioned into two parts, and . After that, q2 is passed to the transformer, and q1 is passed to the EfficientB0. By doing this, computational resource use is reduced significantly. The tokenized features are processed through the ViT transformer encoder using MSA as represented in Eq. 10.

Here, Atten(q2) is the output of the multi-head self-attention block, and it works as input for the feed-forward network layer in the encoder, as shown in Eq. 11.

W₁, W₂, b₁, b₂ are weights and biases of the feed-forward network (FFN). The final transformer output is represented in Eq. 12 of the transformer.

The local features from q1 and the global features from the transformer q2 are concatenated in Eq. 10.

A classification head is added to produce the final prediction. The fused features Z_ViT are passed through a fully connected layer with softmax activation, as shown in Eq. 14.

The model’s parameters are optimized using a loss function categorical-cross-entropy for classification shown in Eq. 15.

In the EfficientViT model, several key parameters shape how the model works. The Patch Embedding layer splits the image into 16 × 16 patches (patch size of 16), giving a good balance between detail and efficiency. Then, each of these patches is represented with an embedding dimension of 128 (embed dim), allowing the model to capture complex patterns. The Transformer Encoder Block processes these patches with an embedding size of 128, 12 attention heads (Num heads) to look at the data from different angles, and a feed-forward network dimension of 256 (ff_dim) to help the model understand deeper relationships within the data. The CNN-Transformer Interaction Module combines features from the EfficientNet model, which extracts 128-dimensional features (efficient dim), and the Transformer model, which also works with 128-dimensional features (transformer dim). Last, the Hybrid Model Structure is implemented for a classification task with 10 possible output classes (Num classes), which has two Dense layers with ReLU followed by a final Dense layer using SoftMax activation for making predictions. These parameters all balance model performance, computational load, and accuracy for image classification tasks.

Table 4 illustrates a model using the combination of EfficientNet-B0 and Vision Transformer (ViT) for the classification of images with 8 different classes. First, it defines an input layer that is based on images that are 224 × 224x3. These meaningful features by EfficientNet were then fed through the ViT as a token sequence to infer the relationships across different parts of the image. The model contains about 5.1 million parameters, where most are trainable (5.08 million) and a very small portion is non-trainable, such as fixed weights and batch normalization values, totalling 42,023. This methodology applies EfficientNetB0 to carry out fast feature extraction in a highly effective manner but maintains the ViT capability for global analysis of complex patterns, forming an efficient, powerful classification system.

Table 4. Summary of proposed hybrid model EfficientViT.

Layer	Output shape	Parameters
Input Layer	(None, 224,224,3)	0
Hybrid model (Efficient + ViT) Layer	(None, 8)	5,129,963
Total parameters	5,129,963
Trainable parameters	5,087,940
Non-trainable parameters	42,023

Dataset description

The Kvasir dataset is a valuable resource for medical image analysis, comprising a collection of endoscopic images from the GI tract. It includes images of clinically significant findings and procedures like polyp removal, all meticulously annotated by medical experts. Collected at Vestre Viken Health Trust in Norway, this dataset facilitates the development of computer-aided systems for disease detection and analysis in the GI tract. It contains 4000 images of 8 different classes, each class representing a particular GI disease and normal. It is a well-balanced dataset; each class has 500 images. This open-access dataset available on Kaggle which is an open-source repository, with diverse image categories and high-quality annotations, serves as a crucial resource for researchers in fields such as computer vision to develop and evaluate algorithms for image recognition, classification, and localization within the medical domain Fig. 3 is represent the sample images of every class present in kvasir dataset²³. The 2nd dataset contains 4000 images of four types of GI disease, also available on Kaggle.

[See PDF for image]

Fig. 3

Sample images Gastrointestinal disease (a) dyed-lifted-polyps (b) dyed-resection-margins(c) ulcerative-colitis(d) normal cecum(e) normal pylorus (f) normal-z-line (g) polyps (h) esophagitis.

Results

In this section, the performance of the EfficientViT and MobileNet-ViT models across eight classes is evaluated on fivefold cross-validation. The two models have high accuracy for classification, as assumed from the confusion matrices presented in Figs. 4 and 5. Models have strong results in classes such as dyed-resection-margins, normal-cecum, polyp, and ulcerative colitis, with true positive results at high levels. Some other metrics, such as precision, recall, and F1-score, again support the effectiveness of both models and show balanced performances in all classes in Tables 5 and 6. ROC curves showed strong class discriminatory ability, with most classes showing AUC values across all classes. The loss and accuracy plot indicates a stable training-validation process with very little overtraining. The majority of the misses occurred between closely related classes, for example, on dyed-lifted polyps and dyed-resection margins or esophagitis vs. normal-z-line, but these accounted for relatively few instances in each class. Overall, the results show that the models are reliable and effective for multi-class classification, with scope for minor improvements in feature extraction and handling visually similar classes.

[See PDF for image]

Fig. 4

Confusion matrices for fivefold cross-validation of the hybrid EfficientViT model, showcasing the classification performance across eight classes, highlighting the accuracy and misclassifications for each fold.

[See PDF for image]

Fig. 5

Confusion matrices for fivefold cross-validation of the hybrid MobileNet-ViT model, showcasing the classification performance across eight classes, highlighting the accuracy and misclassifications for each fold.

Table 5. Performance of proposed hybrid EfficientViT model.

Fold	Precision (%)	Recall (%)	F1-Score (%)	Accuracy (%)
Fold 1	99	0.99	99	99.10
Fold 2	100	100	00	100
Fold 3	100	100	00	100
Fold 4	100	100	100	100
Fold 5	100	100	100	100
Average	99.60	99.60	99.60	99.82

Table 6. Performance of the proposed hybrid MobileNet and ViT model.

Fold	Precision (%)	Recall (%)	F1-Score (%)	Accuracy (%)
Fold 1	99	99	99	99
Fold 2	99.25	9925	99.25	99
Fold 3	99.62	9962	9962	100
Fold 4	100	100	100	100
Fold 5	99.75	99.75	99.75	100
Average	99.52	99.52	99.52	99.60

Experimental settings

The script is developed in Python 3.11, utilizing TensorFlow 2.4, and executed on a Windows 11 intel i-7 1350HX system equipped with an Nvidia GeForce RTX 4060 GPU and 32 GB of RAM. It is also trained on a P100 GPU on Kaggle. The batch size, epochs, and learning rate were set to 8, 50, and 0.00001, respectively. Given the asymmetry in the dataset, we used five-fold cross-validation to prevent skewed performance assessments. In each fold of cross-validation, 80% of the pictures are utilized for training, while the remaining 20% are used for validation.

Quantitative results

The hybrid models of EfficientViT and MobileNet-ViT with fivefold cross-validation results show their capability to classify most categories with very high accuracy and minimal errors, as shown in Figs. 4 and 5. The model performed well at distinguishing classes like ulcerative colitis, normal-pylorus, and polyp, as it could capture the unique features, as shown in Tables 5 and 6. However, there are some difficulties for closely related classes, such as esophagitis and normal-z-line or normal-cecum and polyp, probably because the visual patterns are overlapping and subtle feature differences are present. In Fold 1, there were a few more errors; it reported 6 false positives and 6 false negatives in different categories, and dyed resection had 101 true positives (TP). In comparison, Fold 2 was excellent in performance, with only 1 false positive and 1 false negative, indicating a well-balanced dataset and effective classification, giving approximately 100% accuracy. Folds 3,4, and 5 had low error rates with just 2 false positives and 2 false negatives each, and thus, consistent reliability average accuracy is almost 100%. Overall, the model performed robustly, with an average accuracy for the proposed model is 99.86%.

Table 5 shows the excellent performance of the hybrid EfficientViT model on all folds of fivefold cross-validation. In Fold 1, the model obtained high scores with 99% precision, recall, and F1-score, and an accuracy of 99.10%, with a little more misclassification than other folds. From Folds 2 to 5, the model obtained perfect results, with 100% precision, recall, F1-score, and accuracy, with no misclassifications. On average, the model had 99.6% precision, recall, and F1-score, and an overall accuracy of 99.82%, showing that it was robust and able to perform well in complex classification tasks. The slight dip in Fold 1 suggests that minor variations or overlaps in the dataset may have influenced the results, but overall, the model proves highly reliable and efficient.

The confusion matrices in Fig. 4 describe the performance of a hybrid model that integrates ViT and MobileNet evaluated by fivefold cross-validation across eight classes. There is a very strong accuracy in classification, as high true positive rates are reported in all folds. There is a good performance of the model in the cases of classes like dyed-resection-margins, normal-cecum, polyp, and ulcerative colitis with minimum misclassifications. Fold 1, the model produced 6 false positives and 4 false negatives, with most errors arising from confusion between visually similar classes like Esophagitis, Normal-Pylorus, and Polyps. In Fold 2, the model showed improvement, with only 1 false positive and 4 false negatives, indicating greater consistency, though challenges remained in distinguishing Dyed-Lifted Polyps, Dyed-Resection Margins, and Esophagitis. However, in Fold 3, while the model achieved 0 false positives, it encountered a higher number of false negatives, 9, primarily due to misclassifications in the Normal-Cecum class, highlighting a specific difficulty in correctly identifying this category. Overall, the model performed robustly, although there were occasional misclassifications that indicate potential areas of improvement, such as enhancing feature extraction or addressing class imbalances. The average accuracy for the proposed model is 99.60%.

Table 6 demonstrates the exceptional performance of the hybrid model MobileNet with ViT model across five-folds, achieving near-perfect metrics with an average Precision, Recall, and F1-Score of 99.52% and an Accuracy of 99.60%. In Fold 1, the model scored 99% across all metrics, while Fold 2 and Fold 3 showed slightly improved Precision, Recall, and F1-Scores of 99.25% and 99.62%, respectively, with 100% Accuracy achieved in Fold 3. Fold 4 achieved a perfect score across all metrics, demonstrating flawless classification, and Fold 5 maintained high Precision, Recall, and F1-Score at 99.75%, with perfect Accuracy. These results highlight how the model can generalize well across subsets of data. It consistently scores high in Precision and Recall while reducing the number of errors, thereby being reliable as well as very effective for classification tasks.

Figure 6 illustrates the confusion matrix of the proposed EfficientViT model on Dataset 2, further demonstrating its high classification performance. The diagonal entries show the number of correct predictions for each class, while the off-diagonal entries represent misclassifications. For Esophagitis, 297 out of 300 samples were correctly classified, with 3 samples misclassified as Ulcerative colitis. The Normal class was perfectly classified, with all 300 samples correctly identified, indicating flawless discrimination of normal images. In the Polyps class, 293 out of 300 samples were correctly predicted, with 1 sample misclassified as Esophagitis and 6 as Ulcerative colitis, suggesting minor confusion between similar inflammatory or polypoid lesions. For Ulcerative colitis, 297 out of 300 samples were correctly classified, with 3 samples misclassified as Polyps. Overall, the confusion matrix confirms the high sensitivity and specificity observed in the quantitative metrics, with minimal inter-class misclassification. The slight confusion between Polyps and Ulcerative colitis could be attributed to overlapping visual features in endoscopic imagery. Nonetheless, the EfficientViT model demonstrates excellent diagnostic reliability across all gastrointestinal disease classes in Dataset 2.

[See PDF for image]

Fig. 6

Confusion matrix on dataset 2 of the proposed EfficientViT.

Table 7 indicates the performance metrics of the proposed EfficientViT model on Dataset 2, with great classification results for all classes. The model achieved overall accuracy of 99.17%, with per-class precision, recall, and F1-score being over 97% for every class. Specifically, the Normal class achieved perfect precision, recall, and F1-score (100%), indicating all normal samples were perfectly classified with no misclassifications. For Esophagitis and Ulcerative colitis, recall was perfect (100%), with none of the disease cases excluded, although precision was slightly lower (99%) due to a negligible level of false positives. The Polyps class achieved slightly lower recall (97%), meaning some polyp samples were misclassified; however, its precision was high (99%), meaning good positive predictions. Both macro and weighted averages were always 99%, meaning the model was consistent in well-balanced classes.

Table 7. Performance of the proposed EfficientViT model on dataset 2.

Class	Precision (%)	Recall (%)	F1-score (%)	Support	Accuracy (%)
Esophagitis	0.99	1.00	0.99	300	0.9917
Normal	1.00	1.00	1.00	300
Polyps	0.99	0.97	0.98	300
Ulcerative colitis	0.99	1.00	0.99	300
Macro avg	0.99	0.99	0.99	1200
Weighted avg	0.99	0.99	0.99	1200

Discussion and analysis of results

This section analyzes the model performance and various statistical metrics, indicating how the approach of the EfficientViT model surpasses the MobileNet-ViT approach in terms of accuracy. The approach of the model benefits much from the powers of extracting complex spatial features through EfficientNet-B0, while ViT’s attention mechanics enhance the ability of the feature map to capture both local and global correlations. The dataset, on which we have tested the model, has brought an impressive accuracy to demonstrate its strong performance. The high AUC scores, as shown in Fig. 9, and accuracy for classes such as ulcerative colitis and polyp show that these disease types have distinct morphological features that are captured well by the attention mechanisms of the model. On the other hand, classes like esophagitis and normal-z-line exhibit slightly lower AUC and accuracy values, probably because of subtle morphological overlaps between these categories. Table 8 provides the hybrid model’s performance across different folds.

Table 8. Class-wise performance of EfficientViT on dataset 1.

Class	Accuracy (%)	Precision (%)	Recall (%)
Esophagitis	100.00	100.00	100.00
Normal Z-line	100.00	100.00	100.00
Polyps	100.00	100.00	100.00
Dyed-lifted polyps	100.00	100.00	100.00
Ulcerative colitis	100.00	100.00	100.00
Erythema	99.16	98.33	98.00
Hemorrhoids	99.16	98.33	98.00
Normal pylorus	99.16	98.33	98.00
Average	99.82	99.62	99.50

Accuracy and loss graph

Figure 7 represents the performance of the EfficientViT for the classification task of GI disease over 50 epochs. Figure 7(a) shows accuracy; the blue curve shows training accuracy that has a sharp rise and peaks near 100%, meaning that the model learned from the training data well. The orange curve traces the validation accuracy, showing a steady rise in early epochs, but does not reach quite the same heights as its training accuracy. After almost 10 epochs, there is a small oscillation in validation accuracy, revealing some variability in the capacity of the model to generalize to unseen data, possibly indicating slight overfitting; however, these oscillations do not strongly affect the overall performance that the model delivers for classifying GI diseases. Figure 7(b) shows the loss, with the blue curve representing the training loss, which starts high but decreases sharply in the initial epochs and stabilizes at a low value, indicating a negligible error on the training dataset. The orange curve represents the validation loss, which also drops significantly early on but remains higher than the training loss. After the initial decline, the validation loss oscillates slightly around a low value, indicating good generalization. Although the steady decline in training loss shows good learning, minor fluctuations in validation loss reflect slight generalization problems. In summary, the model shows outstanding performance in minimizing errors for both training and validation datasets with strong generalization for GI disease classification.

[See PDF for image]

Fig. 7

Shows the accuracy and loss over the training and validation dataset are presented in (a) and (b) of EfficientViT.

Figure 8 illustrates the model’s performance during training and validation over 50 epochs. Training accuracy rapidly improves, reaching nearly 100% by the 4th epoch, with training loss dropping close to zero, indicating the model effectively fits the training data. Validation accuracy starts high at 87.11%, steadily rising to 99.10%, while validation loss decreases consistently with minor early fluctuations, stabilizing at 0.031 by epoch 50. These trends showcase the model’s impressive ability to learn, its effectiveness in generalizing to new data, and its minimal indications of overfitting, which all point to a well-trained and balanced model.

[See PDF for image]

Fig. 8

shows the accuracy and loss graphs presented in (a) and (b) of MobileNet-ViT.

Receiver operating characteristics (ROC)

The ROC curve in Fig. 9 tells a pretty compelling story about how well the proposed hybrid model, which is EfficientViT, is able to classify GI diseases. Each curve here indicates a different disease, and all the AUC scores are 1.00 except esophagitis, which has an AUC of 98, and normal z line, which has an AUC of 99, indicating a flawless ability to distinguish between all the GI conditions. This further suggests that it is extremely sensitive, with perfect detection of true positives, but also highly specific, avoiding false positives, hence a very efficient tool for the task. However, the applicability of these impressive results is generally challenged by real-world problems that tend to come with their own disadvantages. For instance, in other cell types classification areas, some categories may have overlapping characters, or there may not be enough data on those categories, which makes it relatively harder for models to function effectively in such scenarios. The same could happen in GI disease classification when more complex or subtle cases come in. While the performance of this model is currently outstanding, room for growth always exists. Future directions may be fine-tuning the model further or introducing more data to ensure the model deals with even the most obscure cases confidently. For now, however, this model marks a big step forward in providing a reliable and accurate tool for the classification of GI disease and holds huge promise for use in the real world of clinical applications.

[See PDF for image]

Fig. 9

ROC curves of the EfficientViT model in the classification of the 8 GI Classification dataset 1.

Comparison of the proposed method results

In this section, we compare the results of both experiment-based class-wise accuracy in Tables 8 and 9 and the Comparison of proposed hybrid models with Mery et al.²⁴ in the classification of 8 Gastrointestinal disease classifications on the same dataset. state of the art comparison with another existing model in Tables 12 and 13

Table 9. Class-wise performance of MobileNet-ViT on dataset 1.

Class	Accuracy (%)	Precision (%)	Recall (%)
Esophagitis	98.33	97.91	98.00
Normal Z-line	98.33	98.00	98.00
Polyps	97.50	96.66	97.00
Dyed-lifted Polyps	96.66	95.83	96.00
Ulcerative colitis	98.33	97.91	98.00
Erythema	97.50	96.66	96.00
Hemorrhoids	97.50	96.66	96.00
Normal Pylorus	98.33	97.91	98.00
Average	97.93	96.96	97.25

Three evaluation metrics, precision, recall, and F1-score, are used in Tables 8 and 9 to describe the class-wise performance of two deep learning models, EfficientViT and MobileNet. Each metric covers eight distinct classes of medical image analysis EfficientViT performs somewhat better than Dyed-Lifted Polyps in every parameter (100 vs. 99.60 Precision, 100 vs. 99 Recall, and 100 vs. 99.40 F1-Score). With perfect scores (100 vs. 99 for all measures), Dyed-Resection Margins: EfficientViT also performs exceptionally well. Esophagitis: EfficientViT performs marginally better in Recall (99.40 vs. 98) and F1-Score (99.20 vs. 98.60), whereas MobileNet with ViT exhibits marginally superior Precision (99.60 vs. 99.20). Normal-Cecum: All measures reach 99.60 or 100, indicating equal performance from both models. Normal-Pylorus: While all other metrics are similar or almost equivalent, MobileNet with ViT narrowly wins out in precision (100 vs. 99.40). While MobileNet with ViT achieves perfect recall (100 vs. 99.20), normal-Z-Line: EfficientViT exhibits superior precision (99.60 vs. 98). The F1 scores (99.20 vs. 99.20) are similar. MobileNet with ViT performs similarly and occasionally outperforms it, such as Precision for"Normal-Pylorus."EfficientViT consistently receives higher or perfect scores in the majority of classes, especially for difficult categories like"Dyed-Lifted Polyps“and”Dyed-Resection Margins."

Table 10 presents a class-wise summary of the EfficientViT model’s performance using two key evaluation metric ROC-AUC with 95% confidence intervals and Dice Overlap Scores. Results demonstrate strong reliability, with an overall accuracy of 99%, a macro-averaged ROC-AUC of 0.9928, MCC of 0.9856, and Cohen’s Kappa of 0.9856. Class-level performance further confirms this, with specificity values such as 0.99 for Polyps and 0.98 for Ulcerative Colitis, and corresponding sensitivities of 0.98 and 0.99, respectively. We also included Area Under the Precision-Recall Curve (AUPRC) to ensure robust assessment in low-prevalence class contexts. Additionally, macro and micro-averaged ROC curves with 95% confidence intervals have been added to Table 17 for enhanced statistical validity. The highest Dice score was recorded for Normal (0.9986), while Polyps had the lowest (0.9792), likely due to inter-class variability and visual complexity. Collectively, these metrics confirm that EfficientViT is both highly accurate and reliable for fine-grained GI disease classification.

Table 10. Confidence intervals for ROC-AUC on dataset 2.

Class	ROC-AUC	95% CI	Dice overlap score
Esophagitis	0.995	[0.993–0.997]	0.9909
Polyps	0.987	[0.981–0.993]	0.9792
Ulcerative colitis	0.982	[0.977–0.989]	0.9858
Normal	0.998	[0.997–1.000]	0.9986

Table 11 presents a comparative analysis of the EfficientViT model’s performance on two datasets: Dataset 1 (evaluated via fivefold cross-validation) and Dataset 2 (an independent external test set). On Dataset 1, the model achieved exceptionally high results with an average accuracy of 99.82%, precision of 99.60%, recall of 99.60%, and F1-score of 99.60%, demonstrating consistent reliability across folds. When evaluated on Dataset 2, unseen during training, the model maintained robust generalisation with an accuracy of 99.17%, precision of 99.00%, recall of 99.25%, and F1-score of 99.00%. The marginal drop in metrics, particularly in recall and F1-score, suggests minimal overfitting and confirms that EfficientViT retains strong predictive performance even on novel data.

Table 11. EfficientViT performance on dataset 1 vs dataset 2.

Metric	Dataset 1 (5-Fold average)	Dataset 2 (External test set)
Accuracy (%)	99.82	99.17
Avg. precision (%)	99.60	99.00
Avg. recall (%)	99.60	99.25
Avg. F1-Score (%)	99.60	99.00
ROC-AUC	0.9944	0.9928
MCC	0.9856	0.9856
Cohen’s Kappa	0.9856	0.9856

Figure 10 shows that EfficientViT is the winner model that gives the best scores on all of the evaluation metrics. It performed better than the MobileNet model and the Mery et al.²⁴ model. EfficientViT impressively achieved 99.6% Precision, Recall, and F1 Score and 99.82% Accuracy. MobileNet comes right after with slightly lower but still competitive values, proving its strong performance but failing to be nearly as good as EfficientB0. The other two models, though, score lower for Mery et al.²⁴, and it runs behind the other two models since its uniform score at 99.52% is clearly far off from the best of both. Overall, EfficientViT provides the best and well-balanced results. Therefore, it is considered the most efficient of the three models.

[See PDF for image]

Fig. 10

Comparison of proposed hybrid models with Mery et al.²⁴ in the classification of 8 Gastrointestinal disease classifications on the same dataset 1.

The comparison in Table 12 highlights the exceptional performance of our proposed hybrid models, EfficientViT and MobileNet with ViT, in gastrointestinal (GI) disease classification. Our models achieved accuracies of 99.86% and 99.60%, respectively, outperforming many state-of-the-art techniques. For instance, methods like Capsule Networks (99.72%) and Deep Hexa Model (99.32%) demonstrated strong results but fell slightly short of our models’ accuracy. Similarly, EfficientNet B1 (98.94%) and ResNet50 (98%) performed well but did not match the precision of our hybrid approach. Other models, such as InceptionResNetV2 (85.7%) and CNN + ViT (94%), showed lower accuracy levels, highlighting the superiority of integrating EfficientViT and MobileNet with Vision Transformers. This excellent performance comes from the robust feature extraction with attention-based mechanisms, which makes our EfficientViT very efficient in discriminating between several types of GI diseases. The results highlight the robustness and the potential of our approach in enhancing diagnostic accuracy and decision-making in the clinic concerning GI disease classification.

Table 12. Performance comparison with other methods under different experimental settings.

Authors	Model	Dataset	Accuracy (%)
Almarshad et al.²¹	Capsule Network	Kvasir Balanced dataset (8 different types of GI disease)	99.72
Mary et al.²⁴	Resnet50		98
Otaibi et al.²⁵	EfficientNet B1		98.94
Ghany et al.²⁶	EfficientNet-B0, ResNet101v, InceptionResNetV2		97.77, 97.375, 98.06
Ajitha et al.²⁷	Deep SS-Hexa		99.16
Patel et al.²⁷	EfficientNetB5		92.58
Ali et al.²⁸	InceptionResNetV2		85.7
Ganesh et al.²⁹	CNN + ViT		94
Jagarajan et al.³⁰	Mask RCNN		98.8
Linu et al.³¹	Deep Hexa Model		99.32
Proposed	EfficientViT MobileNet-ViT		99.82 99.60

Table 13 presents a comprehensive comparison of the proposed EfficientViT model with two prominent Transformer-based architectures, namely TinyViT and Swin Transformer, evaluated on Dataset 2. The performance is assessed using multiple metrics, including Accuracy, Loss, ROC-AUC, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa, to ensure robust and multidimensional evaluation. The results demonstrate that EfficientViT outperforms both TinyViT and Swin Transformer across all evaluation metrics. Specifically, EfficientViT achieved an accuracy of 99%, which is 4% higher than the competing models, indicating superior classification capability. Furthermore, the loss value of 0.0038 reflects a significantly better model fit and lower prediction error compared to TinyViT (0.0174) and Swin Transformer (0.0213). In terms of generalization performance, EfficientViT achieved a ROC-AUC of 0.9928, which is notably higher than TinyViT (0.9683) and Swin Transformer (0.9700), confirming its enhanced discriminatory ability between classes. The MCC and Cohen’s Kappa values of 0.9856 for EfficientViT further validate its near-perfect correlation and agreement with ground truth labels, outperforming both comparative models, which exhibited MCC and Kappa values around 0.94. These superior outcomes can be attributed to the architectural design of EfficientViT, which combines transformer-based global attention mechanisms with efficient convolutional feature extraction, enabling it to achieve high predictive accuracy with minimal computational overhead.

Table 13. Compare EfficientViT with other models, on dataset 2.

Model	Accuracy	Loss	ROC-AUC	MCC	Cohen’s Kappa
EfficientViT	0.99	0.0038	0.9928	0.9856	0.9856
TinyViT	0.95	0.0174	0.9683	0.9367	0.9367
Swin transformer	0.95	0.0213	0.9700	0.9401	0.9400
MobileViT	0.97	0.0125	0.9745	0.9480	0.9480
MobileNetV2	0.91	0.0320	0.9240	0.8721	0.8700
EfficientNetB0	0.93	0.0260	0.9410	0.8965	0.8950

Table 14 compares the computational complexity and training efficiency of EfficientViT with other models on Dataset 2, demonstrating that EfficientViT achieves the highest accuracy of 99% with only 4.8 million parameters and 1.27 billion FLOPs, outperforming TinyViT and Swin Transformer, which achieve 95% accuracy with higher parameters (5.6 M and 28.0 M) and FLOPs (1.00B and 4.5B), respectively. Furthermore, EfficientViT exhibits a shorter training time (1956.35 s) than TinyViT (2450 s) and Swin Transformer (5040 s) while only marginally exceeding the lightweight MobileNet (1200 s) and EfficientNetB0 (1350 s), both of which yield a substantially lower accuracy of 91%. These results indicate that EfficientViT provides an optimal balance between computational cost, model complexity, and performance, making it a superior and practical choice for high-accuracy, resource-efficient medical image classification tasks.

Table 14. Compare the parameters, FLOP and training time of the proposed model with other models on dataset 2.

Model	Parameters (M)	FLOPs (B)	Training Time (s)	Accuracy
EfficientViT	5.1	1.27	1956.35	99
TinyViT	5.6	1.00	2450	95
Swin Transformer	28.0	4.5	5040	95
EfficientNetB0	5.3	0.39	1350	91
MobileNet	3.5	0.30	1200	91

Ablation study

To assess how each component impacts our proposed model, we conduct an ablation study. Here, we remove or modify individual elements of the architecture systematically to better understand their impact on the model performance. Our EfficientViT model has three main components.

EfficientNetB0 feature extraction

EfficientNetB0 has a high focus on local feature extraction, which serves as the backbone of our first experiment. Providing critical feature maps for further processing. Modification: EfficientNet-B0 was replaced with a standard CNN with similar parameters. Results: The accuracy dropped from 99.82% (full model) to 91.30%, showing that EfficientNet-B0 is superior in feature extraction.

ViT transformer block

The transformer block is in charge of capturing global dependencies in feature maps. Modification: The transformer block was removed, and the classification was done directly on feature maps from EfficientNetB0. Results: Accuracy was reduced to 93.12% as the transformer block was removed, which further represented the relevance of the transformer block in enhancing feature representation.

Table 15 shows the analysis between our proposed EfficientViT model and baseline Vision Transformer (ViT) configurations with different depths. While both EfficientViT and the ViT baseline use 2 transformer blocks, EfficientViT achieves a markedly higher accuracy (99.17%) and ROC-AUC (0.9928) compared to the standard ViT (accuracy: 95.00%, ROC-AUC: 0.9700). Even increasing the ViT depth to 4 blocks only modestly improves accuracy to 98.00%, yet it significantly raises the computational cost (FLOPs and parameters). This performance gap highlights the effectiveness of EfficientViT’s hybrid design, which fuses EfficientNetB0-based CNN local features with ViT-based global context, using a late fusion strategy. These results demonstrate that architectural synergy between convolutional and transformer pathways is more effective than simply increasing ViT depth alone resulting in better generalization and lower resource usage for medical image classification.

Table 15. Comparison of EfficientViT vs. ViT with varying depths.

Model	ViT Blocks	Accuracy (%)	ROC-AUC	Params (M)	FLOPs (B)	Remarks
EfficientViT (Ours)	2	99.17	0.9928	5.1	1.27	CNN + ViT fusion (2 blocks), late fusion
ViT (2 Blocks)	2	95.00	0.9700	3.8	1.10	Self-attention only; lacks local context
ViT (4 Blocks)	4	98.00	0.9800	5.5	1.50	Deeper ViT, no CNN features

Fusion of EfficientNetB0 and transformer features

The fusion step combines the local and global features extracted by EfficientNet-B0 and the transformer block. Modification: The fusion operation was replaced with a simple concatenation of features. Results: Using concatenation instead of fusion decreased the performance to 95.47%, showing the efficiency of the proposed fusion mechanism.

Z_Fusion = Concat (F_EffNet, Z_ViT) (16).

Effects of each component

The ablation study indicates the strength of different elements in the proposed model. EfficientNet-B0 was able to extract local features with an accuracy of 92.13%, whereas MobileNet, being similar, resulted in a lower accuracy of 91%. The ViT showed an improved representation of the global context by achieving an accuracy of 97.25%. The integration of MobileNet with ViT has shown enhanced performance, culminating in an accuracy of 99.60%, which shows the merits of integrating lightweight backbones with global feature extraction mechanisms.

The best accuracy obtained by the proposed EfficientViT model is 99.82% along with a loss of 0.0038, which establishes superior fusion between the local and global features. Thus, each of these components turns out to be essential in gaining state-of-the-art performance; in other words, it’s the fusion mechanism of EfficientViT that takes this model ahead of the others. In Table 16, we evaluate each component’s effect on the model separately.

Table 16. Effects of different components on model performance.

Experiment	Accuracy (%)	Loss	Comments
EfficientNetB0	92.13	0.6790	High local feature extraction
MobileNet	91	0.8950	local feature extraction
ViT	97.25	0.1166	Improved global context
MobileNet-ViT	99.60	0.0215	Higher accuracy
EfficientViT	99.82	0.0038	The best fusion approach increased

Table 17 presents the results of paired t-tests performed to assess the statistical significance of performance differences between EfficientViT and other transformer-based models. The comparison of EfficientViT versus ViT yielded a t-statistic of 7.1264 with a p-value of < 0.0001, indicating a mean accuracy improvement of 5.33%, while the comparison with Swin Transformer resulted in a t-statistic of 5.3086 with a p-value of < 0.0001, corresponding to a mean accuracy difference of 3.42%. Both p-values are below the conventional significance threshold of 0.05, confirming that EfficientViT’s superior performance is statistically significant compared to both ViT and Swin Transformer. These findings underscore that the observed improvements are not due to random variation but reflect genuine enhancements attributable to the EfficientViT architecture.

Table 17. Statistical significance (Paired t-tests).

Comparison	T-Statistic	P-Value	Mean accuracy difference	Significance
EfficientViT vs. ViT	7.1264	0.0000	0.0533	Statistically significant
EfficientViT vs. Swin-T	5.3086	0.0000	0.0342	Statistically significant

Table 18 shows that the patch embedding layer transforms dense CNN features into a ViT-compatible sequence format, enabling self-attention computation. Adding positional embeddings allows the transformer to recognise the relative positions of image patches, something CNNs handle via spatial convolution, but ViTs require it explicitly. The patch embedding step helps reduce dimensionality while preserving semantic richness, contributing to computational efficiency and reducing overfitting. To illustrate the impact of the patch embedding step, we conducted an ablation experiment where the patch embedding operation (flattening + positional encoding + linear projection) was bypassed, and the CNN features were directly fed into the transformer block. The results in Table 17 clearly show that removing the patch embedding step leads to a drop of over 2% in accuracy and a noticeable increase in loss, confirming that the patch embedding layer is critical for enabling effective global context modeling through the ViT module.

Table 18. Impact of the patch embedding.

Configuration	Accuracy (%)	F1 Score (%)	Loss
With patch embedding (original)	99.82	99.60	0.0038
Without patch embedding (no reshaping)	97.42	96.89	0.0427

Visualizations of the result

Grad-CAM visualisations and transformer attention maps in Fig. 11, demonstrating that the model consistently focuses on clinically relevant anatomical regions in endoscopic images. The Grad-CAM heatmaps highlight discriminative features, while the attention maps extracted from the transformer layers show token-wise contributions, enhancing clinical interpretability. Additionally, in Fig. 9, we present the distribution of prediction confidence scores, which demonstrates that the majority of predictions have high confidence, indicating model reliability.

[See PDF for image]

Fig. 11

Grad-CAM visualisations and transformer attention.

Figure 12 indicates the distribution of confidence scores on predictions made by the model. The histogram shows that predictions are primarily close to having confidence scores of 1.0, i.e., the model is very confident about its responses for most inputs. This extreme skew towards high confidence. Implicates that the model is resolute and not often indecisive. Supports low prediction entropy, which is favourable in deployment settings since the model is not likely to generate low-confidence predictions that need to be examined manually. Nonetheless, though high confidence is advantageous, care should be taken such that these confident predictions are indeed accurate, lest there be overconfidence with consequent misclassification with high probability.

[See PDF for image]

Fig. 12

Distribution of prediction confidence scores.

Conclusion

This study leveraged the EfficientViT and MobileNet-ViT models to classify eight categories of gastrointestinal (GI) diseases. The proposed EfficientViT framework combines EfficientNet-B0 for robust local feature extraction with Vision Transformers (ViTs) for global dependency feature mapping, achieving superior performance in multi-class classification tasks. By combining these complementary strengths, EfficientViT demonstrated exceptional reliability and accuracy, achieving a 99.82% accuracy in Experiment 1, while MobileNet-ViT achieved a competitive 99.60% accuracy in Experiment 2. The fivefold cross-validation results validated the model generalizability, achieving near-perfect accuracy and surpassing state-of-the-art techniques, despite minor misclassifications in overlapping categories. This highlights EfficientViT potential to enhance automated GI disease diagnosis and clinical decision-making. The future work would be to improve the feature extraction technique to better differentiate visually similar categories, and real-world clinical applications with diverse and large-scale datasets would be explored. Other directions could include the integration of federated learning frameworks to address the privacy aspect of medical data. Federated learning would allow the model to train collaboratively across multiple institutions while ensuring that all data is private and compliant with regulatory standards.

Author contributions

Conceptualization, Vishesh Tanwar, Dhirendra Prasad Yadav, Bhisham Sharma; Data curation, Vishesh Tanwar, Dhirendra Prasad Yadav, Bhisham Sharma; Formal analysis, Abolfazl Mehbodniya; Investigation, Abolfazl Mehbodniya; Methodology, Vishesh Tanwar, Dhirendra Prasad Yadav, Bhisham Sharma; Project administration, Abolfazl Mehbodniya; Resources, Abolfazl Mehbodniya; Software, Bhisham Sharma; Visualization, Vishesh Tanwar, Bhisham Sharma; Writing – original draft, Vishesh Tanwar, Dhirendra Prasad Yadav, Bhisham Sharma; Writing – review & editing, Abolfazl Mehbodniya.

Funding

This research received no external funding.

Data availability

Data Availability Statement: The data of the present study can be downloaded from the URL: https://www.Kaggle.Com/Datasets/Meetnagadia/Kvasir-Dataset and 2nd dataset link is https://www.kaggle.com/datasets/visheshtanwar26/colondataset.

Declarations

Competing interests

The authors declare no competing interests.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Du, W; Rao, N; Liu, D; Jiang, H; Luo, C; Li, Z; Gan, T; Zeng, B. Review on the applications of deep learning in the analysis of gastrointestinal endoscopy images. IEEE Access; 2019; 7, pp. 142053-142069. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2944676]

2. Kobayashi, S; Saltz, JH; Yang, VW. State of machine and deep learning in histopathological applications in digestive diseases. World J. Gastroenterol.; 2021; 27, 20 pp. 2545-2575. [DOI: https://dx.doi.org/10.3748/wjg.v27.i20.2545] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34092975][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8160628]

3. Ham, H-S; Lee, H-S; Chae, J-W; Cho, HC; Cho, H-C. Improvement of gastroscopy classification performance through image augmentation using a gradient-weighted class activation map. IEEE Access; 2022; 10, pp. 99361-99369. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3207839]

4. Guimarães, P; Finkler, H; Reichert, MC; Zimmer, V; Grünhage, F; Krawczyk, M; Lammert, F; Keller, A; Casper, M. Artificial-intelligence-based decision support tools for the differential diagnosis of colitis. Eur. J. Clin. Investig.; 2023; [DOI: https://dx.doi.org/10.1111/eci.13960]

5. Solitano, V; Zilli, A; Franchellucci, G; Allocca, M; Fiorino, G; Furfaro, F; D’Amico, F; Danese, S; Al Awadhi, S. Artificial endoscopy and inflammatory bowel disease: Welcome to the future. J. Clin. Med.; 2022; 11, 3 569. [DOI: https://dx.doi.org/10.3390/jcm11030569] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35160021][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8836846]

6. Pannu, HS; Ahuja, S; Dang, N; Soni, S; Malhi, AK. Deep learning-based image classification for intestinal hemorrhage. Multimed. Tool. Appl.; 2020; 79, 29–30 pp. 21941-21966. [DOI: https://dx.doi.org/10.1007/s11042-020-08905-7]

7. Khorasani, HM; Usefi, H; Peña-Castillo, L. Detecting ulcerative colitis from colon samples using efficient feature selection and machine learning. Sci. Rep.; 2020; 10, 1 13744.1:CAS:528:DC%2BB3cXhs1CktLfJ [DOI: https://dx.doi.org/10.1038/s41598-020-70583-0] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32792678][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7426912]

8. Obayya, M; Al-Wesabi, FN; Maashi, M; Mohamed, A; Hamza, MA; Drar, S; Yaseen, I; Alsaid, MI. Modified salp swarm algorithm with deep learning based gastrointestinal tract disease classification on endoscopic images. IEEE Access; 2023; 11, pp. 25959-25967. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3256084]

9. Fati, SM; Senan, EM; Azar, AT. Hybrid and deep learning approach for early diagnosis of lower gastrointestinal diseases. Sensor; 2022; 22, 11 4079.2022Senso.22.4079F [DOI: https://dx.doi.org/10.3390/s22114079] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35684696][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9185306]

10. Owais, M; Arsalan, M; Mahmood, T; Kang, JK; Park, KR. Automated diagnosis of various gastrointestinal lesions using a deep learning-based classification and retrieval framework with a large endoscopic database: Model development and validation. J. Med. Internet Res.; 2022; 22, 11 [DOI: https://dx.doi.org/10.2196/18563] e18563.

11. Attallah, O; Sharkas, M. GASTRO-CADx: A three stages framework for diagnosing gastrointestinal diseases. PeerJ. Comput. Sci.; 2021; 7, [DOI: https://dx.doi.org/10.7717/peerj-cs.423] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33817058][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959662]e423.

12. Iqbal, I; Walayat, K; Kakar, MU; Ma, J. Automated identification of human gastrointestinal tract abnormalities based on deep convolutional neural network with endoscopic images. Intell. Syst. Appl.; 2022; 16, [DOI: https://dx.doi.org/10.1016/j.iswa.2022.200149] 200149.

13. Komeda, Y; Handa, H; Matsui, R; Hatori, S; Yamamoto, R; Sakurai, T; Takenaka, M; Hagiwara, S; Nishida, N; Kashida, H; Watanabe, T; Kudo, M. Artificial intelligence-based endoscopic diagnosis of colorectal polyps using residual networks. PLoS ONE; 2021; 16, 61:CAS:528:DC%2BB3MXhsVCisLrL [DOI: https://dx.doi.org/10.1371/journal.pone.0253585] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34157030][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8219125]e0253585.

14. Wang, W; Yang, X; Li, X; Tang, J. Convolutional-capsule network for gastrointestinal endoscopy image classification. Int. J. Intell. Syst.; 2022; 37, 9 pp. 5796-5815. [DOI: https://dx.doi.org/10.1002/int.22815]

15. Hossain, MS; Rahman, MdM; Syeed, MM; Uddin, MF; Hasan, M; Hossain, MdA; Ksibi, A; Jamjoom, MM; Ullah, Z; Samad, MA. DeepPoly: Deep learning-based polyps segmentation and classification for autonomous colonoscopy examination. IEEE Access; 2023; 11, pp. 95889-95902. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3310541]

16. Naz, J; Sharif, MI; Sharif, MI; Kadry, S; Rauf, HT; Ragab, A. EA Comparative analysis of optimization algorithms for gastrointestinal abnormalities recognition and classification based on ensemble XcepNet23 and ResNet18 features. Biomedicines; 2023; 11, 6 1723. [DOI: https://dx.doi.org/10.3390/biomedicines11061723] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37371819][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10296714]

17. Mahmood, S; Fareed, MMS; Ahmed, G; Dawood, F; Zikria, S; Mostafa, A; Jilani, SF; Asad, M; Aslam, M. A robust deep model for classification of peptic ulcer and other digestive tract disorders using endoscopic images. Biomedicines; 2022; 10, 9 2195. [DOI: https://dx.doi.org/10.3390/biomedicines10092195] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36140296][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9496137]

18. Buendgens, L; Cifci, D; Ghaffari Laleh, N; van Treeck, M; Koenen, MT; Zimmermann, HW; Herbold, T; Lux, TJ; Hann, A; Trautwein, C; Kather, JN. Weakly supervised end-to-end artificial intelligence in gastrointestinal endoscopy. Sci. Rep.; 2022; 12, 1 4829.2022NatSR.12.4829B1:CAS:528:DC%2BB38XnvVGmsrg%3D [DOI: https://dx.doi.org/10.1038/s41598-022-08773-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35318364][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8941159]

19. Wafa, A. A., Essa, R. M., Abohany, A. A., Abdelkader, H. E. Integrating deep learning for accurate gastrointestinal cancer classification: A comprehensive analysis of MSI and MSS patterns using histopathology data Neural Comput. Appl.36(34) 21273 21305 https://doi.org/10.1007/s00521-024-10287-y

20. Xiao, P; Pan, Y; Cai, F; Tu, H; Liu, J; Yang, X; Liang, H; Zou, X; Yang, L; Duan, J; Xv, L; Feng, L; Liu, Z; Qian, Y; Meng, Y; Du, J; Mei, X; Lou, T; Yin, X; Tan, Z. A deep learning-based framework for the classification of multi-class capsule gastroscope image in gastroenterology diagnosis. Front. Physiol.; 2022; [DOI: https://dx.doi.org/10.3389/Phys.2022.1060591] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36703930][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9748093]

21. Almarshad, FA; Balaji, P; Syed, L; Aljohani, E; Dharmarajlu, SM; Vaiyapuri, T; AlAseem, NA. An efficient optimal capsnet model-based computer-aided diagnosis for gastrointestinal cancer classification. IEEE Access; 2024; 12, pp. 137237-137246. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3442831]

22. Khan, MA; Kadry, S; Alhaisoni, M; Nam, Y; Zhang, Y; Rajinikanth, V; Sarfraz, MS. Computer-aided gastrointestinal diseases analysis from wireless capsule endoscopy: A framework of best features selection. IEEE Access; 2020; 8, pp. 132850-132859. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3010448]

23. Pogorelov, K., Randel, K. R., Griwodz, C., Eskeland, S. L., de Lange, T., Johansen, D., Spampinato, C., Dang-Nguyen, D.-T., Lux, M., Schmidt, P. T., Riegler, M., Halvorsen, PKVASIR 9((164–169https://doi.org/10.1145/3083187.3083212 (New York, NY, USA, 2017).

24. Mary, XA; Raj, APW; Evangeline, CS; Neebha, TM; Kumaravelu, VB; Manimegalai, P. Multi-class classification of gastrointestinal diseases using deep learning techniques. Open Biomed. Eng. J.; 2023; [DOI: https://dx.doi.org/10.2174/18741207-v17-e230215-2022-HT27-3589-11]

25. Al-Otaibi, S; Rehman, A; Mujahid, M; Alotaibi, S; Saba, T. Efficient-gastro: Optimized EfficientNet model for the detection of gastrointestinal disorders using transfer learning and wireless capsule endoscopy images. PeerJ. Comput. Sci.; 2024; 10, [DOI: https://dx.doi.org/10.7717/peerj-cs.1902] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38660212][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11041956]e1902.

26. El-Ghany, SA; Mahmood, MA; Abd El-Aziz, AA. An accurate deep learning-based computer-aided diagnosis system for gastrointestinal disease detection using wireless capsule endoscopy image analysis. Appl. Sci.; 2024; 14, 22 10243.1:CAS:528:DC%2BB2cXisF2lsLfO [DOI: https://dx.doi.org/10.3390/app142210243]

27. Patel, V., Patel, K., Goel, P., Shah, M. Classification of Gastrointestinal Diseases from Endoscopic Images Using Convolutional Neural Network with Transfer Learning In 2024 5th International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV) 504–508 https://doi.org/10.1109/ICICV62344.2024.00085 (IEEE, 2024).

28. Ali, A; Iqbal, A; Khan, S; Ahmad, N; Shah, S. A two-phase transfer learning framework for gastrointestinal diseases classification. PeerJ Comput. Sci.; 2024; 10, [DOI: https://dx.doi.org/10.7717/peerj-cs.2587] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39896396][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11784777]e2587.

29. Ganesh, B., Annamalai, R., Jayapal, S. Enhancing Gastrointestinal Image Classification: A Fusion of CNN and Vision Transformers In 2024 IEEE International Conference on Contemporary Computing and Communications (InC4), 1–6 https://doi.org/10.1109/InC460750.2024.10649204 (IEEE 2024).

30. Jagarajan, M; Jayaraman, R. AI in gastrointestinal disease detection: Overcoming segmentation challenges with Coati optimization strategy. Evol. Syst.; 2025; 16, 1 2. [DOI: https://dx.doi.org/10.1007/s12530-024-09627-z]

31. Linu Babu, P; Jana, S. Gastrointestinal tract disease detection via deep learning-based duo-feature optimized hexa-classification model. Biomed. Signal Process. Control; 2025; 100, [DOI: https://dx.doi.org/10.1016/j.bspc.2024.106994] 106994.

32. Ahamed, MF; Nahiduzzaman, M; Islam, MR; Naznine, M; Arselene Ayari, M; Khandakar, A; Haider, J. Detection of various gastrointestinal tract diseases through a deep learning method with ensemble ELM and explainable AI. Expert Syst. Appl.; 2024; 256, 124908 [DOI: https://dx.doi.org/10.1016/j.eswa.2024.124908] 124908.

33. Ahamed, MF; Shafi, FB; Nahiduzzaman, M; Ayari, MA; Khandakar, A. Interpretable deep learning architecture for gastrointestinal disease detection: A Tri-stage approach with PCA and XAI. Comput. Biol. Med.; 2025; 185, 109503 [DOI: https://dx.doi.org/10.1016/j.compbiomed.2024.109503] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39647242]109503.

34. Hussain, T; Shouno, H; Mohammed, MA; Marhoon, HA; Alam, T; Yang, H-Y. EFFResNet-ViT: A fusion-based convolutional and vision transformer model for explainable medical image classification. IEEE Access; 2025; 13, pp. 54040-54068. [DOI: https://dx.doi.org/10.1109/ACCESS.2025.3554184]

35. Hussain, T; Shouno, H. MAGRes-UNet: Improved medical image segmentation through a deep learning paradigm of multi-attention gated residual U-Net. IEEE Access; 2024; 12, pp. 40290-40310. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3374108]

36. Hussain, T; Shouno, H; Mohammed, MA; Marhoon, HA; Alam, T. DCSSGA-UNet: Biomedical image segmentation with DenseNet channel spatial and semantic guidance attention. Knowl.-Based Syst.; 2025; 314, [DOI: https://dx.doi.org/10.1016/j.knosys.2025.113233] 113233.

Word count: 10087

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

GI diseases are one of the leading causes of morbidity and mortality worldwide, and early and accurate diagnosis is considered to be very important. Traditional methods like endoscopy take time and depend majorly on the judgment of the physician. The proposed Efficient Vision Transformer (EfficientViT) is a new deep learning-based model using EfficientNetB0 in combination with the Vision Transformer (ViT) for the classification of eight different types of diseases in the GI system. EfficientViT utilizes the features of EfficientNetB0 to capture local textures and multi-scale features to achieve structural changes in the GI tract. At the same time, it includes the capacity of the ViT model to recognize the context of images of the GI tract for the detection of slight disease patterns and precursors of disease diffusion. Furthermore, we designed a dual-block in which input is divided into two parts (q1, q2) to better optimize the model q1 processed through an EfficientNet for local details and a q2 through encoder block for capturing the global dependencies, which enables EfficientViT to pay attention to multiple image regions simultaneously. We have tested the model using fivefold cross-validation and achieved an outstanding accuracy of 99.82% compared to the MobileNetV2-based model which reached 99.60%. In addition, EfficientViT demonstrated excellent precision, recall, and F1 scores. Our model, in general, outperforms existing methods, offering a promising tool for clinicians to more reliably and accurately diagnose GI diseases from endoscopic images.

Details

Title

Hybrid deep learning framework based on EfficientViT for classification of gastrointestinal diseases

Author

Tanwar, Vishesh¹; Sharma, Bhisham²; Yadav, Dhirendra Prasad³; Mehbodniya, Abolfazl⁴

¹ Chitkara University, Chitkara University Institute of Engineering and Technology, Rajpura, India (GRID:grid.428245.d) (ISNI:0000 0004 1765 3753)
² Chitkara University, Centre of Research Impact and Outcome, Rajpura, India (GRID:grid.428245.d) (ISNI:0000 0004 1765 3753)
³ GLA University, Department of Computer Engineering & Applications, Mathura, India (GRID:grid.448881.9) (ISNI:0000 0004 1774 2318)
⁴ Kuwait College of Science and Technology (KCST), Department of Electronics and Communication Engineering, Kuwait City, Kuwait (GRID:grid.510476.1) (ISNI:0000 0004 4651 6918)

Pages

26982

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20452322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41598-025-12128-x

ProQuest document ID

3232917623

Hybrid deep learning framework based on EfficientViT for classification of gastrointestinal diseases

Jump to:

Full text

Abstract

Details

Suggested sources