A novel hybrid vision UNet architecture for brain

Full text

Turn on search term navigation

Introduction

A brain tumor is a life-threatening and complicated medical condition affecting the quality of life of millions of people globally. Due to their complex nature, there is a massive demand for precise diagnosis and efficient treatment¹. Brain tumors pose a significant global health concern. In a study named Surveillance, Epidemiology, and End Results (SEER), it is reported that 6.2 persons are identified with new brain tumors and other nervous system cases each year for every 100,000 people, with 4.4 deaths per 100,000 people. It is found that approximately 33.4% of those diagnosed people with brain tumors will have a maximum lifetime of five years².

It is found from the literature that a vast difference exists in behaviour, prognosis, and response to therapy of benign and malignant tumors^{3, 4–5}. Early identification and accurate tumor location and categorization are critical for efficient brain tumor treatment and enhanced patient outcomes⁶. Magnetic Resonance Imaging (MRI) is the most often utilized neuroimaging technology for detecting and analyzing brain tumors due to its superior contrast resolution and ability to visualize soft tissues^7,8. Still, manual interpretation of MRI results requires significant time and effort and is vulnerable to inter- and intra-observer variability. Thus, automated and reliable methods to perform brain tumor diagnosis and treatment are extremely important. The study of medical imaging has advanced significantly in recent years due to breakthroughs in machine learning, deep learning, and imaging technology. This revolution has had a substantial impact on the segmentation and classification of brain tumors⁹.

Abnormalities in the growth of the brain cells cause brain tumors, which can be either benign or malignant¹⁰. Tumors in medical images are identified by segmenting and defining Regions of Interest (RoI) using imaging techniques such as MRI or CT scans. Segmentation is crucial for diagnosis, prognosis, and treatment planning. Brain tumor segmentation faces challenges due to anatomical diversity, image noise, and tumor size and shape variations, making the design of robust segmentation algorithms complicated^11,12. Accurate classification of brain tumors helps medical professionals execute surgery more effectively and helps communicate prior information to patients and their families. This may eventually result in enhanced patient outcomes and advances in cancer research^13,14.

Deep learning-based approaches have demonstrated great potential in medical image segmentation in recent years. The UNet architecture has grown in prominence due to its ability to perform semantic segmentation tasks efficiently. It uses a contracting path to collect context and a symmetric expanding path to achieve exact localization¹⁵.The UNet model is made up of convolutional neural networks with encoder-decoder architecture. The encoder’s input image is processed to extract its features. The decoder, which upsamples intermediate data, generates the final output. Subsequently, a number of UNet architectural variations were created to address certain issues and enhance brain tumor segmentation and classification performance¹⁶.

The 3D UNet, a unique network architecture intended for dense volumetric segmentation with sparse annotations, is described in the study¹⁷. The network incorporates rapid elastic deformations for effective data augmentation during training, and it expands the UNet architecture to 3D operations. It uses batch normalization for faster convergence. Promising segmentation performance is demonstrated using Xenopus kidney data¹⁸.

Due to its design restrictions, the typical UNet architecture failed to focus on key regions in complex tumor contexts. The Attention UNet introduces attention gates, which selectively emphasize essential elements while suppressing unimportant ones, to overcome this problem. This mechanism improves the model’s focus on important regions, leading to a more precise segmentation of tumor regions and improving diagnosis accuracy and treatment planning¹⁹.

Residual UNet uses dense blocks or residual connections to improve gradient flow to avoid vanishing gradients in the training of neural networks. These improvements make training a deeper network easier, improving segmentation and learning efficiency. The different variants of UNet overcome the problems encountered in traditional UNet, leading to improved segmentation performance²⁰.

UNet with transfer learning uses pre-trained models such as VGG, ResNet, DenseNet etc., as the encoder improves segmentation accuracy by taking advantage of pre-trained models’ strong feature extraction capabilities^{21, 22, 23, 24–25}. When there is a shortage of medical data, leveraging pre-trained models can greatly lower training time and boost efficiency. Using these well-established models, the UNet architecture can perform better in tasks involving the segmentation of medical images, thereby filling the gap between the requirement for high-precision segmentation in clinical applications and the restricted supply of data.

Transformer models like the Vision Transformer (ViT) are ideal for medical applications because they capture long-term relationships and represent complex patterns²⁶. ViT has shown outstanding performance in image classification tasks and has been applied to challenges in image segmentation. Recent models like TransUNet and TransBTS are specifically made for difficult tasks like brain tumor segmentation^27,28. These models improve feature extraction and fusion through the use of transformers and CNNs, leading to more accurate segmentations. To segment brain tumors from 3D MRI data, TransUNet uses a transformer along with a U-Net, whereas TransBTS uses a transformer along with a 3D CNN. In TransBTS, the encoder uses 3D CNN to extract spatial features. The decoder creates a better segmentation map by applying transformer features. The transformer in TransBTS models the global context, which helps the decoder segment data more effectively.

UNetR is the hybrid model that uses transformer encoder and CNN decoder²⁹. This model performs well for medical applications since it can handle brain tumors of any size or shape. The Swin-UNet model analyses images in-depth and with contextual awareness by combining the advantages of Swin Transformer and UNet³⁰. It uses the shifted window attention techniques of Swin Transformer in the encoder and keeps the UNet architecture in the decoder for mask creation and upsampling. This makes it possible to identify tumor borders precisely in medical imaging.

A hybrid transformer-enhanced convolutional neural network (TECNN) model for brain tumor classification that combines CNNs for local feature extraction with Transformers to capture global context³¹. A Feature Fusion Module (FFM) and an Intelligent Merge Module (IMM) were employed to bridge the gap between CNN and Transformer representations, enhancing feature integration. The model also utilizes channel-wise attention and adaptive pooling to retain class-specific information, achieving high accuracy on the BraTS 2018 and Figshare datasets.

A novel customized pretrained EfficientNetB7 model was developed for brain tumor classification, with MR images enhanced using the “FastNIMeansDenoisingColored” filter for noise removal. Images were cropped to remove extra boundaries, reducing computational time. Several pre-trained deep learning models–including AlexNet, VGG16/19, ResNet50, InceptionV3, DenseNet121, and EfficientNet variants–were evaluated on a multiclass brain tumor dataset. EfficientNetB7 showed the best initial performance, and after customization and fine-tuning (CPEB7), it achieved high accuracy on the 5th fold of k-fold cross-validation, outperforming existing methods³².

Table 1. Summary of brain tumor segmentation and classification methods in the literature.

Authors	Dataset	Approach	Results / Metrics	Limitations / Future Work
Mehta & Arbel et al.¹⁷	BraTS2018	3D UNet	Dice scores: ET: 0.706 WT: 0.871 TC: 0.771	Need to improve testing accuracy; limited generalizability
Cicek et al. (2016)¹⁸	Xenopus kidney	3D UNet from sparse annotation	IoU: 0.863	Performance may vary with different dataset characteristics; sensitive to annotation quality
Gitonga et al.¹⁹	BraTS2021	3D Attention-based UNet	Dice Coefficient: 0.9864	Computationally intensive
Asiri et al. (2023)²⁰	TCGA-LGG, TCIA MRI	ResNet50 + UNet	IoU: 0.91, DSC: 0.95, SI: 0.95	Limited to LGG class
Shedbalkar & Prabhushetty et al.²¹	Figshare MRI	UNet + chopped VGGNet	Accuracy: 98.93%, Sensitivity: 0.98, Precision: 0.9833, F1-score: 0.9833	Limited validation and generalization
Pravitasari et al.²²	Custom	UNet-VGG16	Accuracy: 96.1%	Need to explore different architecture
Kolarik et al.²⁴	Custom + MICCAI 2016 MRI	3D Dense-U-Net	SSIM: 0.78547, PSNR: 24.09 dB	Need to explore different datasets
Chen et al.²⁷	Synapse multi-organ segmentation dataset	TransUNet	DSC: 77.48, HD: 31.69	Need to evaluate on different dataset
Wang et al.²⁸	BraTS 2019	TransBTS	Dice scores: ET: 78.92 WT: 90.23 TC: 81.19	Computationally intensive
Hatamizadeh et al.²⁹	BraTS 2021	Swin UNETR	ET DSC: 0.858, HD: 6.016 WT DSC: 0.926, HD: 5.831 TC DSC: 0.885, HD: 3.770	High memory usage
Cao et al.³⁰	Synapse multi-organ segmentation dataset	Swin-Unet	DSC: 79.13 HD: 21.55	Pure transformer; still evolving for medical images
Aloraini et al.³¹	BraTS 2018 and Figshare	ViT-CNN	Accuracy: 96.75% (BraTS), 99.10% (Figshare)	Need to explore with lightweight CNN model
Khushi et al.³²	Multiclass brain tumor Kaggle dataset	EfficientNetB7	Accuracy: 98.97%	Need to evaluate on real medical image dataset

The Table 1 summarizes various brain tumor segmentation and classification methods, highlighting the datasets used, approaches taken, reported performance metrics, and noted limitations or future work. The evaluation metrics reported include Dice Similarity Coefficient (DSC), Intersection over Union (IoU), Sensitivity Index (SI), Hausdorff Distance (HD), Structural Similarity Index (SSIM), and Peak Signal-to-Noise Ratio (PSNR). While these methods demonstrate promising results, several challenges remain evident from the table. Many approaches suffer from limited generalizability due to dataset-specific training, computational inefficiency, or difficulties in accurately segmenting small or complex tumor subregions. Additionally, some models are restricted to certain tumor types or require extensive computational resources, limiting their clinical applicability. These gaps highlight the need for more robust, efficient, and generalizable models capable of extracting both local and global features effectively in diverse clinical scenarios.

This research introduces a novel unified architecture for both brain tumor segmentation and classification: the Hybrid Vision UNet-Encoder Decoder (HVU-ED) segmenter and its classification counterpart, the Hybrid Vision UNet-Encoder (HVU-E) classifier. The key strength of this approach lies in utilizing the same base encoder architecture for both tasks. For segmentation, the full HVU-ED network comprising both the encoder and decoder is employed, whereas for classification, only the encoder portion is used, followed by dedicated classification layers. Unlike traditional transfer learning methods that depend on pre-trained weights, this method transfers only architectural knowledge from pre-trained models such as ResNet50, DenseNet121, VGG16, and Xception. These hybrid models capture hierarchical features ranging from low-level edges and textures to high-level semantic structures through multiple deep layers. Incorporating pre-trained architectures into UNet enhances feature extraction, resulting in more refined and comprehensive features that improve both segmentation and classification performance. In complex, data-scarce domains like medical imaging, this hybrid approach leverages the strengths of UNet, Vision Transformer’s global feature extraction, and pre-trained model architectures to build a single, versatile network capable of accurately handling both segmentation and classification tasks. The efficacy of this architecture is confirmed by incorporating the XAI techniques, which enhance diagnostic reliability and build trust in both segmentation and classification models. This approach ensures the model predictions and allows clinicians to evaluate and interpret the reason behind the decision.

Results and discussions

Evaluation metrics

The literature on brain tumor segmentation and classification reports various performance metrics which is evaluated by using True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN)³³.

The Dice score is metric used to measure the overlap between the predicted and ground truth segmentation masks.

Accuracy examines the model’s ability to identify all correct classes or pixels, independent of polarity.

Precision refers to the fraction of valid positive predictions provided by the model.

Recall is the percentage of ground truth annotations that match the model’s predictions.

Sensitivity is the percentage of true positives detected properly by the model out of all actual positive events.

Specificity is assessed as the ratio of accurately predicted negatives to actual negatives. The text reveals the number of unidentified pixels or classes.

F1-score is the harmonic mean of precision and recall, which provides a fair evaluation of a model’s performance.

Evaluation and comparison of segmenter

The novel HVU-ED segmenter and HVU-E classifier architecture are proposed to improve the efficiency of brain tumor segmentation and classification respectively. The BraTS2020 dataset was used to appraise the segmentation performance of HVU-ED network. The model was trained with 80% of training and 10% of validation data to segment the brain tumor from MRI scans. The trained model was evaluated with 10% of testing data. Performance metrics including Dice, accuracy, precision, recall sensitivity and specificity metrics were computed using equation (1) to (6) respectively. The training and validation performance of the segmentation model over 50 epochs is visualized in Fig. 1. The training loss steadily decreased, while both validation accuracy and Dice score improved and stabilized after approximately 30 epochs. This indicates effective convergence and generalization of the model. The results of four different segmenters such as ResVU-ED, VggVU-ED, XceptionVU-ED and DenseVU-ED are showcased in the Table 3. It’s worth noting that the DenseVU-ED model outperformed others with 98.91% of accuracy, 0.98 of precision, 0.99 of recall, 0.99 of sensitivity and 0.99 of specificity. This high performance results from DenseNet’s dense connections, which enhance feature reuse and gradient flow, combined with the Vision Transformer’s ability to capture global context. Together, these architectural strengths significantly boost the segmentation accuracy of DenseVU-ED.

[See PDF for image]

Fig. 1

Training and validation metrics over 50 epochs. (Left) Loss, (Middle) Accuracy, and (Right) Dice score curves show the convergence behavior of the segmentation model.

Table 2. Training, validation, and test accuracy of HVU-ED segmenter model on BraTS dataset.

Model	Training Accuracy (%)	Validation Accuracy (%)	Test Accuracy (%)
ResVU-ED	98.7	97.01	96.5
VggVU-ED	99.3	98.27	97.8
XceptionVU-ED	98.7	97.01	96.5
DenseVU-ED	99.6	98.91	98.2

Table 2 presents the training, validation, and test accuracies of the proposed HVU-ED segmenter variants on the BraTS dataset. Among the models evaluated, the DenseVU-ED demonstrated the highest overall performance, achieving 99.6% training accuracy, 98.91% validation accuracy, and 98.2% test accuracy. The VggVU-ED also performed well, with a test accuracy of 97.8% and validation accuracy of 98.27%. Both ResVU-ED and XceptionVU-ED showed consistent performance with identical training and validation accuracies, though slightly lower than the Dense and Vgg-based variants. These results highlight the effectiveness of the hybrid encoder-decoder design and the benefit of integrating pre-trained models for improving generalization in brain tumor segmentation.The graphical plot for this model for accuracy is portrayed in Fig. 2.

Additionally, Table 3 showcases results of these four segmenters in terms of precision, recall, sensitivity and specificity. Notably, DenseVU-ED outperformed the others with 98.91% precision, 0.98 precision, 0.99 recall, 0.99 sensitivity and 0.99 specificity.

Table 3. Results of the proposed HVU-ED segmenter model.

Model	Precision	Recall	Sensitivity	Specificity
ResVU-ED	0.97	0.98	0.97	0.99
VggVU-ED	0.98	0.98	0.98	0.99
XceptionVU-ED	0.97	0.98	0.97	0.97
DenseVU-ED	0.98	0.99	0.99	0.99

[See PDF for image]

Fig. 2

Graphical representation of accuracy values for the HVU-ED Segmenter.

The proposed architecture’s superiority was highlighted by comparing dice score results for Enhanced Tumor (ET), Core Tumor (CT), and Whole Tumor (WT) with existing brain tumor segmentation methods using the BraTSdataset. As shown in the Table 4, the HVU-ED segmenter yielded higher dice scores compared to state-of-the-art methods, with scores of 0.902 for ET, 0.954 for CT, and 0.966 for WT.

[See PDF for image]

Fig. 3

Grad-Cam visualization of the proposed HVU-ED Segmenter.

Table 4. Comparative analysis of HVU-ED segmenter model using dice score with literature models.

Model	Enhanced Tumor	Core Tumor	Whole Tumor
VGG-UNet³⁴	0.818	0.864	0.887
UNet with Dense & Resnet³⁵	0.766	0.815	0.901
TransBTS²⁸	0.7857	0.817	0.9
HUT³⁶	0.783	0.836	0.9
Swin UNet3D³⁷	0.834	0.866	0.905
Swin UNetR³⁸	0.853	0.876	0.927
DenseNet121-UNet³⁹	0.892	0.943	0.959
HVU-ED	0.902	0.954	0.966

Grad-CAM visualization of the proposed HVU-ED segmenter model

Grad-CAM (Gradient-weighted Class Activation Mapping) is a visualization method applied to interpret the convolutional neural networks (CNNs) decisions⁴⁰. It emphasizes the most important regions of an image to afford insight for the model’s predictions. It is used to evaluate the model by focusing on the accurate spot of the image, which is crucial for making decisions and predictions. Fig. 3 shows a layer-wise interpretation of the HVU-ED segmenter model for a test image. Grad-CAM is used to highlight attention maps across various layers, verifying that the model concentrates on the relevant brain regions during the segmentation process. By utilizing the gradients of a target region from different convolutional layers, Grad-CAM produces a coarse localization heatmap that highlights crucial areas in the test image relevant to segmenting the tumor area.

Evaluation and comparison of classifier

The extracted features from HVU-E classifier model were used to classify the MRI into glioma, meningioma and pituitary tumor images. The metrics like accuracy, precision, recall sensitivity, specificity and F1-score were used to validate the efficiency of the trained model using equation (2) to (7). Table 5 provides the training, validation, and test accuracies for the HVU-E classifier variants evaluated on the BraTS dataset. Among all models, the DenseVU-E classifier achieved the highest accuracy across all stages, with 99.8% training accuracy, 99.18% validation accuracy, and 98.9% test accuracy demonstrating strong generalization and robustness. VggVU-E also exhibited competitive performance, attaining 99.3% training, 98.01% validation, and 98.1% test accuracy. ResVU-E and XceptionVU-E classifiers followed with slightly lower but still consistent accuracies. In addition to accuracy, this classifier model also achieved a precision of 0.99, recall of 0.98, sensitivity of 0.99, specificity of 0.99, and an F1-score of 0.98, as shown in Table 6. These results further confirm the advantage of integrating dense connectivity into the HVU-E framework, yielding improved classification performance in brain tumor analysis tasks. The classification performance of the proposed model across the three tumor types,Glioma, Meningioma, and Pituitary is illustrated in Fig. 4. The confusion matrix indicates a high classification accuracy, with minimal misclassifications, particularly between Glioma and Meningioma, and Pituitary tumors.

[See PDF for image]

Fig. 4

Confusion matrix showing classification results for Glioma, Meningioma, and Pituitary tumors.

[See PDF for image]

Fig. 5

Graphical plot of accuracy for the HVU-E Classifier.

[See PDF for image]

Fig. 6

Graphical plot of accuracy for the HVU-E Classifier with ML.

Table 5. Training, validation, and test accuracy of HVU-E classifier on brain tumor dataset.

Model	Training Accuracy (%)	Validation Accuracy (%)	Test Accuracy (%)
ResVU-E	99.1	98.04	97.6
VggVU-E	99.4	98.01	97.8
XceptionVU-E	99.5	98.30	98.0
DenseVU-E	99.34	99.18	98.7

Table 6. Classification results of the HVU-E classifier with classification layer.

Model	Precision	Recall	Sensitivity	Specificity	F1 Score
ResVU-E	0.98	0.98	0.98	0.99	0.98
VggVU-E	0.99	0.98	0.98	0.99	0.98
XceptionVU-E	0.99	0.98	0.98	0.99	0.98
DenseVU-E	0.99	0.98	0.99	0.99	0.98

Table 7. Classification results of the proposed HVU-E classifier with machine learning algorithms.

Model	SVM	RF	DT	Ada Boost	LR
ResVU-E	90.71	80.11	78.04	76.04	75.35
VggVU-E	91.81	80.07	79.42	76.17	74.72
XceptionVU-E	81.91	84.34	76.79	75.85	72.21
DenseVU-E	92.21	82.72	79.40	76.52	79.15

In the brain tumor classification process, the HVU-E classifier was employed with machine learning algorithms such as SVM, RF, DT, LR and Ada Boost. As summarized in Table 7, the ResVU-E, VggVU-E, and DenseVU-E configurations achieved higher classification accuracy when paired with the SVM algorithm, reaching 90.71%, 91.81%, and 92.21%, respectively. Notably, the XceptionVU-E classifier demonstrated competitive performance with the RF method, achieving an accuracy of 83.34%.SVM consistently outperformed the other classifiers in terms of accuracy across all hybrid feature variants. While models like LR and DT offer faster computation, they exhibited lower accuracy. Overall, SVM achieved the most favorable trade-off between accuracy and computational efficiency, making it the most effective classifier within the HVU-E framework. The DenseVU-E classifier achieves higher accuracy with neural network classification (99.18%) compared to SVM (92.21%), indicating that neural networks can better capture complex, non-linear patterns in the hybrid features. However, SVM offers a good balance of accuracy and computational efficiency, making it a strong alternative when resources are limited. This comparison highlights that while neural networks may provide superior accuracy, SVM remains a practical choice for faster and more efficient classification.

The brain tumor HVU-E classifier performance is contrasted against the literature techniques to show its superiority. This novel approach succeeds the existing models by yielding 99.18% of accuracy as shown in Table 8. The graphical plot for the accuracy metric values is shown in the Fig. 6. From this study, it is observed that DenseVU architecture performed well as compared to other architectures since it strengthens feature propagation, encourages feature reuse, alleviates the vanishing gradient problem, and substantially reduces the number of parameters.

Table 8. Comparative analysis of HVU-E classifier model.

Model	Accuracy (%)
Deep-ViT²³	91.60
FT-VIT⁴¹	98.13
R50-ViT-l16⁴²	90.31
ViT-b32⁴²	98.24
Randomized ViT⁴³	98.86
Ensemble ViT⁴⁴	98.70
HVU-E	99.18

Explainability for the proposed HVU-E classifier

LIME⁴⁵ and SHAP⁴⁶ techniques are used to demonstrate the effectiveness of the proposed classifier model. LIME method initially identifies and segments the region which influences the model to predict the input image into particular category. These regions with similar pixels are grouped as super pixels that mostly cover the tumor area which contribute positively to the model classification. The positive super pixel area is highlighted with green color and the irrelevant area that negatively contributes the model is displayed in red color. Lime interpretation for a pituitary tumor image is shown in the Fig. 7. This visual explanation witnesses the performance of the proposed classifier model.

[See PDF for image]

Fig. 7

Lime Interpretation of a test image a. A sample image with pituitary tumor b. Superpixel segmentation c. Final perturbed image showing positive contribution in green and negative in red.

When applying the SHAP method, it will compute and allocate Shapley values to every pixel of the image according to its contribution to the model’s prediction. These values aid the equity and clarity to understand the complex models. These values can be either positive or negative, indicating how strongly each pixel supports or detracts from a specific class prediction. The Shapley values of the proposed classifier model for the classification of brain tumors are shown in Fig 8. In the SHAP visualizations, red pixels indicate a high probability for the predicted class (positive contribution), while blue pixels show a low probability (negative contribution). The output of the classifier’s partition explainer is displayed in the Fig.9. It clearly shows that the proposed model accurately classifies the first test image as a pituitary tumor and the second as meningioma, providing an understandable breakdown of the decision-making process.

[See PDF for image]

Fig. 8

SHAP visualization for the sample images using HVU-E Classifier.

[See PDF for image]

Fig. 9

Shapley values for the sample images.

Methods

Dataset description

The segmentation and classification tasks in the study are conducted using the BraTS2020 and Figshare datasets. Multimodal Brain Tumor Segmentation, or BraTS2020, is a comprehensive medical imaging data repository which is widely used for brain tumor segmentation⁴⁷. Gliomas, the most common kind of brain tumor, are the subject of this dataset. It includes sequences from T1-weighted, T2-weighted, T1-weighted with contrast enhancement (T1-CE), and fluid-attenuated inversion recovery (FLAIR) pre-operative MRI scans. Precisely dividing and categorizing gliomas is essential for efficient treatment planning. Expert annotations of tumor components, including necrotic core, enhancing tumor, and peritumoral edema, are available in the BraTS dataset. The BraTS 2020 training dataset, which comprises 369 labeled MRI cases, was further split into 80% for training, 10% for validation, and 10% for testing, as shown in Table 9.

Table 9. Distribution of BraTS 2020 training data.

Dataset Split	Number of Cases
Training (80%)	295
Validation (10%)	37
Testing (10%)	37
Total	369

In addition, the Figshare brain tumor dataset includes 3064 T1-weighted contrast-enhanced MRI scans from 233 individuals with three types of brain tumors: 1426 glioma slices, 708 meningioma slices, and 930 pituitary tumor slices. This dataset is widely used for brain tumor classification due to its accessibility and availability⁴⁸. The Figshare dataset was split as 80%, 10% and 10% for training, validation and testing data respectively as detailed in Table 10.

Table 10. Distribution of the figshare brain tumor dataset.

Tumor Type	Total Images	Training (80%)	Validation (10%)	Testing (10%)
Glioma	1426	1141	143	142
Meningioma	708	566	71	71
Pituitary	930	744	93	93
Total	3064	2451	307	306

The BraTS images with NIfTI extensions were loaded using the NiBabel python package, which supports various neuroimaging file formats. These images were converted as 2D arrays using Numpy package. Tensorflow, Keras, Scikit-learn Python packages were employed to build and train the model. The BraTS and Figshare images were resized to 256x256 pixels to input the proposed HVU-ED and HVU-E architectures. Subsequently, they underwent several transformations, such as rotation, scaling, and flipping, to create new samples as part of the data augmentation process. The datasets were then split into training, validation, and testing data.

HVU-architecture

The Hybrid Vision U-Net (HVU) architecture is a unified deep learning framework that integrates the strengths of pre-trained Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) within the U-Net structure to address brain tumor segmentation and classification. By fusing local feature extraction from CNNs with the global context modeling of ViTs, HVU effectively captures both fine-grained and holistic information essential for accurate medical image analysis. Four HVU model variants–ResVU-ED, VggVU-ED, XceptionVU-ED, and DenseVU-ED–are constructed by combining ViT modules with ResNet50, VGG16, Xception, and DenseNet121, respectively.These specific pre-trained CNNs were chosen due to their complementary design principles and strong track record in medical imaging.

The ResVU-ED design combines the ResNet model with ViT and UNet to maximize its capabilities. ResNet, known for its residual blocks and ability to address diminishing gradients, has been widely recognized in deep learning. In this architecture, the first 16 layers of ResNet50 were employed as a feature extractor to capture contextual data necessary for precise pixel-wise segmentation and classification with 4,17,43,876 parameters.

The VggVU-ED segmenter leverages the potential of the VGG16 model with ViT and UNet. VGG16, renowned for its deep and robust architecture, effectively extracts high-quality image features. The features extracted from 17 layers were concatenated with UNet backbone for the segmentation and classification process. This model dealt with 5,21,34,596 parameters.

The XceptionVU-ED model merges the Xception layers with ViT in the middle block of UNetwith 4,85,14,732 parameters. Xception, an enhanced version of the Inception model, utilizes depthwise separable convolutions and residual connections to extract intricate patterns and semantic details. Combined with U-Net, the features extracted from its 60-layer parameters significantly enhance segmentation.

The DenseVU-ED architecture combines the DenseNet and ViT features with the bottleneck of the UNet model. DenseNet uses a feed-forward mechanism to connect one layer to the next. It utilizes dense connectivity to enable feature reuse and efficient learning. Combined with U-Net, it enhances image segmentation capabilities through effective feature collection and learning. The first 5 layers of DenseNet121 were used to extract features from input images. The total number of parameters used in this architecture was 3,69,57,380. The training and the non-training parameters of the proposed architecture are shown in the Table 11

Table 11. Training and non-training parameters of the proposed HVU model .

Model	Layers in Hybrid models	Trainable Parameters	Non-Trainable Parameters	Total Parameters
ResVU-ED	16	4,17,03,556	40,320	4,17,43,876
VggVU-ED	17	5,21,22,820	11,776	5,21,34,596
XceptionVU-ED	60	4,84,57,644	57,088	4,85,14,732
DenseVU-ED	5	3,68,92,164	65,216	3,69,57,380

Integration of vision transformer

Inspired by Transformer models’ success in natural language processing, Vision Transformers are a kind of deep learning model for computer vision applications. In the proposed HVU architecture, ViT is incorporated to enhance the global representation capability, complementing the local feature extraction strengths of convolutional neural networks (CNNs). As illustrated in Fig. 10, the ViT architecture leverages self-attention mechanisms to model relationships between different regions of an image. The input image is divided into fixed-size patches of 1616 pixels. A stride of 16 is applied to avoid overlapping, ensuring non-redundant spatial coverage. Each image patch is then flattened into a 1D vector and linearly projected into a lower-dimensional embedding space through a learned projection. This linear transformation is achieved via matrix multiplication with a trainable weight matrix and bias addition during training.To retain spatial context, positional embeddings are added to each token, allowing the model to differentiate among patch positions. The self-attention mechanism then enables each patch to gather contextual information from all other patches, learning long-range dependencies critical for identifying tumors with variable shapes and locations. A feed-forward neural network (FFN) further models complex non-linear interactions between patches. The output from the ViT is integrated with CNN-based encoder features at the bottleneck of the U-Net structure. This fusion produces a hybrid feature representation that combines detailed local cues with global semantic context, enhancing the model’s ability to accurately segment tumors with irregular boundaries and support classification tasks. The final classification head converts these fused embeddings into output predictions.

Feature integration with UNet

The backbone of the proposed Hybrid Vision UNet (HVU) framework is based on the U-Net encoder-decoder architecture. The encoder compresses input images through a series of convolutional and max-pooling layers, capturing detailed spatial and contextual information at multiple levels. Meanwhile, the decoder restores the spatial resolution by upsampling the compressed features and integrates them with corresponding encoder features via skip connections. These skip connections help preserve fine-grained details that are critical for accurate localization of anatomical structures.The architecture features two primary integration mechanisms: skip connections that join corresponding layers in the encoder and decoder paths, and a central bottleneck layer that facilitates the transition between the deepest encoding stage and the beginning of decoding. This bottleneck acts as the fusion point where features from various sources are combined to enrich the overall representation.The architecture features two primary integration mechanisms: skip connections that join corresponding layers in the encoder and decoder paths, and a central bottleneck layer that facilitates the transition between the deepest encoding stage and the beginning of decoding. This bottleneck acts as the fusion point where features from various sources are combined to enrich the overall representation.

In the HVU architecture, feature maps generated by pre-trained convolutional neural networks ResNet50, DenseNet121, VGG16, and Xception are integrated with global representations learned by the Vision Transformer (ViT). These hybrid features are merged at the bottleneck layer of U-Net.The fused representation enhances semantic richness and localization, which is particularly beneficial in segmenting tumors with irregular or ambiguous boundaries.

For classification, the encoder-generated fused features are fed into the HVU-E classification module, which supports either a fully connected neural network or traditional machine learning classifiers such as Support Vector Machines (SVM), Random Forest (RF), Decision Tree (DT), Logistic Regression, and AdaBoost. These machine learning approaches were utilized to assess the robustness of the extracted features and to establish performance baselines. Machine learning classifiers offer several advantages, including lower computational complexity, faster training times, and better generalization in scenarios with limited data. Moreover, their interpretability is beneficial in clinical applications where transparency and trust are critical. By leveraging a shared feature representation for both segmentation and classification, the framework enables joint learning, enhances computational efficiency, and improves generalization across tasks.

[See PDF for image]

Fig. 10

Architecture of vision transformer.

[See PDF for image]

Fig. 11

Architecture of the proposed HVU-ED segmenter.

The fusion of CNN and ViT features is performed at the bottleneck of the U-Net encoder-decoder structure. This fusion ensures that both local and global representations are retained before the upsampling path, thereby enriching the semantic content during feature reconstruction. As a result, the decoder is able to generate more accurate and detailed segmentation outputs, especially in cases where tumor boundaries are diffuse or irregular.

For the classification task, the same fused features extracted from the encoder are forwarded to the HVU-E classification branch. This branch includes either a fully connected neural network layer or traditional machine learning classifiers such as Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), Logistic Regression, and AdaBoost. This shared feature space across segmentation and classification facilitates efficient multi-task learning and enhances overall model performance.

HVU-ED segmenter

The BraTS preprocessed images with a size of 256x256 were used as input for the HVU-ED segmenter architecture, as shown in Fig 11. The UNet encoder consists of several 3x3 convolutional layers with a ReLU activation function. The pooling method downsamples the features to reduce the spatial dimensions with a stride of two. After each downsampling, the channels are doubled to compensate for the decreased spatial dimensions. The model summary Table 12 showcases input size, filter size, number of filters, activation function, and output size. The feature maps from the U-Net encoder, transfer learning models, and ViT are concatenated at the bottleneck of the HVU-ED architecture, all sharing a common shape of 16x16x3. The concatenated feature map is then passed to the subsequent layers of the HVU-ED decoder path for further processing and segmentation. The decoder works opposite to the encoder, upsampling the features to restore spatial resolution. The decoder reconstructs the segmented output based on the comprehensive feature representation obtained from the concatenated encoder outputs.

The HVU-ED segmenter model was trained using the Adam optimizer with a learning rate of 0.001 for 50 epochs. A batch size of 1 was chosen to ensure good generalization of the training and testing images. Parameters used for training the segmenter model are listed in a Table 13. The segmented image produced by this architecture is shown in Fig 12. Performance evaluation of the proposed novel segmentation architecture is based on metrics such as dice score, accuracy, precision, recall, sensitivity, and specificity.

Table 12. Summary of the HVU-ED segmenter model.

Blocks	Layers	Input Size	Filter Size	No. of Filters	Activation Function	Output Size
	Input	2562561	-	-	-	2562561
Encoder Block-1	Conv1	2562561	33	64	ReLU	25625664
	Conv2	25625664	33	64	ReLU	25625664
	MaxPooling	25625664	22	-	-	128128128
Encoder Block-2	Conv1	12812864	33	128	ReLU	128128128
	Conv2	128128128	33	128	ReLU	128128128
	MaxPooling	128128128	22	-	-	6464128
Encoder Block-3	Conv1	6464128	33	256	ReLU	6464256
	Conv2	6464256	33	256	ReLU	6464256
	MaxPooling	6464256	22	-	-	3232256
Encoder Block-4	Conv1	3232256	33	512	ReLU	3232512
	Conv2	3232512	33	512	ReLU	3232512
	MaxPooling	3232512	22	-	-	1616512
Encoder Block-5	Conv1	1616512	33	1024	ReLU	16161024
	Conv2	16161024	33	1024	ReLU	16161024
Bottleneck Block	ViT	2562563	-	-	GeLU	16163
	Hybrid Model	2562563	-	-	ReLU	16163
	UNet	16161024	-	-	ReLU	16161024
	Concatenate	16161024	-	-	-	32321024
Decoder Block-1	Conv1	32321024	33	512	ReLU	3232512
	Conv2	3232512	33	512	ReLU	3232512
	Concatenate	3232512	-	-	-	6464512
Decoder Block-2	Conv1	6464512	33	256	ReLU	6464256
	Conv2	6464256	33	256	ReLU	6464256
	Concatenate	6464256	-	-	-	128128256
Decoder Block-3	Conv1	128128512	33	128	ReLU	128128128
	Conv2	128128128	33	128	ReLU	128128128
	Concatenate	128128128	-	-	-	256256128
Decoder Block-4	Conv1	256256128	33	64	ReLU	25625664
	Conv2	25625664	33	64	ReLU	25625664
Output	Conv3	25625664	11	-	Softmax	2562561

[See PDF for image]

Fig. 12

The segmented image from the four segmenters ResVU-ED, VggVU-ED, XceptionVU-ED and DenseVU-ED are shown in first, second, third and fourth row respectively.

Table 13. Training parameters for the HVU-ED segmenter.

Parameters	Value
Input Size	256256
Convolution Kernel size	33
Max pool size	22
Stride	2
Learning rate	0.001
No. of epochs	50
Batch size	1
Optimizer	Adam
Activation Function	Softmax
Loss	Categorical cross-entropy

Table 14. Training parameters for the HVU-E classifier.

Parameters	Value
Input Size	2562563
Convolution Kernel size	33
Max pool size	22
Stride	2
Learning rate	0.001
No. of epochs	50
Batch size	32
Optimizer	Adam
Activation Function	Softmax
Loss	Categorical cross-entropy

Table 15. Summary of the HVU-E classifier model.

Blocks	Layers	Input Size	Filter Size	No. of Filters	Activation Function	Output Size
	Input	2562561	-	-	-	2562561
Encoder Block-1	Conv1	2562561	33	64	ReLU	25625664
	Conv2	25625664	33	64	ReLU	25625664
	MaxPooling	25625664	22	-	-	128128128
Encoder Block-2	Conv1	12812864	33	128	ReLU	128128128
	Conv2	128128128	33	128	ReLU	128128128
	MaxPooling	128128128	22	-	-	6464128
Encoder Block-3	Conv1	6464128	33	256	ReLU	6464256
	Conv2	6464256	33	256	ReLU	6464256
	MaxPooling	6464256	22	-	-	3232256
Encoder Block-4	Conv1	3232256	33	512	ReLU	3232512
	Conv2	3232512	33	512	ReLU	3232512
	MaxPooling	3232512	22	-	-	1616512
Encoder Block-5	Conv1	1616512	33	1024	ReLU	16161024
	Conv2	16161024	33	1024	ReLU	16161024
Bottleneck Block	ViT	2562563	-	-	ReLU	16163
	Hybrid Model	2562563	-	-	ReLU	16163
	UNet	16161024	-	-	ReLU	16161024
	Concatenate	16161024	-	-	-	32321024
Dense Block	Flatten	16161024	-	-	-	2621441
	Dense-1	2621441	-	128	ReLU	1281
	Dense-2	1211	-	64	ReLU	641
Output	Dense-3	641	-	-	Softmax	3

[See PDF for image]

Fig. 13

The proposed HVU-E classifier architecture.

HVU-E classifier

The HVU-ED segmenter architecture is designed for image segmentation, but it can also be repurposed for classification tasks with some modifications. This involves replacing the decoder with a flattened dense layer and the classification layer with a softmax activation function. Figshare images were input for the HVU-E classifier, as shown in Fig 13. The preprocessed images were then fed into hybrid ViT and UNet models. The U-Net’s encoder gathers structural and local features at different levels of abstraction, which are crucial for distinguishing between classes. In addition to the features from the UNet, the bottleneck layer captures global and complex features from the ViT and transfer learning models. For the classification task, the combined features from the three models were flattened and fed into dense layers with a ReLU activation function. The model parameters like input size, filter size, number of filters, activation function, and output size are displayed in Table 15. This classification model was fine-tuned using the Adam optimizer with a learning rate of 0.001 for 50 epochs, employing the softmax activation function. A batch size 32 was used for the training and validation datasets, along with the categorical cross-entropy loss function. The Table 14 shows training parameters of this model. The brain tumor classification results were evaluated using accuracy, F1-score, precision, recall, sensitivity, and specificity metrics. Similarly, the flattened features from the HVU-E classifier were used as input for machine learning algorithms such as SVM, RF, DT, LR and AdaBoost to classify brain tumor images as glioma, meningioma, and pituitary.

Conclusion

This study introduced HVU-ED for segmentation and HVU-E for classification, two novel hybrid models that combine the power of vision transformers, pre-trained encoders and U-Net architecture for brain tumor analysis. This hybrid strategy improves overall segmentation performance by combining the strengths of global and local feature extraction mechanisms. The HVU-ED segmenter achieved a segmentation accuracy of 98.91%, with Dice scores of 0.902 (enhancing tumor), 0.954 (tumor core), and 0.966 (whole tumor). Building on this, the HVU-E classifier demonstrated strong generalization, achieving a classification accuracy of 99.18% with a dense output layer and 92.21% using an SVM. Additionally, explainable AI (XAI) techniques were employed to validate and visualize the model’s decision-making process, reinforcing its clinical interpretability. Even though much pertinent architecture has been published in the literature, the improved performance of this proposed model will be evidence of its prominence. This proposed model can be adapted to wide range medical image segmentation and classification tasks in the future. This versatility makes this network more fine-tuned and efficient for various medical image applications.

Limitations and future work

The proposed HVU-ED and HVU-E models demonstrated strong performance but were evaluated on a single dataset, which may limit their generalizability in broader clinical applications. Future work will focus on extending validation across diverse datasets, integrating clinical metadata, and improving model efficiency for deployment in real-time healthcare environments.

Acknowledgements

The Core Research Grant (CRG/2022/008050) of the Department of Science & Technology (SERB) is funding this research project. The authors gratefully thank the funding agency for their support in this research work.

Author contributions

M.R conceptualized the study and prepared the original draft, while K.N provided key guidance in refining the research idea. K.R contributed through manuscript revision and editing, and Dr. N.R oversaw the research process, offering essential direction throughout.

Data availability

The segmentation and classification tasks in this study were conducted using the BraTS2020 and Figshare brain tumor datasets. The datasets are publicly available and can be accessed from the following sources: BraTS2020: https://www.kaggle.com/datasets/awsaf49/brats20-dataset-training-validation. Figshare: https://figshare.com/articles/dataset/brain_tumor_dataset/1512427.

Code availability

The source code used for the design and implementation of the HVU-ED and HVU-E models is available at: https://doi.org/10.5281/zenodo.15771335.

Declarations

Competing interests

The authors declare no competing interests.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Biratu, ES; Schwenker, F; Ayano, YM; Debelee, TG. A Survey of Brain Tumor Segmentation and Classification Algorithms. J. Imaging; 2021; 7, 179. [DOI: https://dx.doi.org/10.3390/jimaging7090179] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34564105][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8465364]

2. National Cancer Institute. Brain and other nervous system cancer. https://seer.cancer.gov/statfacts/html/brain.html (2023). Accessed: 2023-02-21.

3. Jayade, S., Ingole, D. T. & Ingole, M. D. Review of Brain Tumor Detection Concept using MRI Images. In 2019 International Conference on Innovative Trends and Advances in Engineering and Technology (ICITAET), 206–209, https://doi.org/10.1109/ICITAET47105.2019.9170144 (2019).

4. Louis, DN et al. The 2021 WHO Classification of Tumors of the Central Nervous System: a summary. Neuro Oncol.; 2021; 23, pp. 1231-1251.1:CAS:528:DC%2BB3MXisFSht73J [DOI: https://dx.doi.org/10.1093/neuonc/noab106] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34185076][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8328013]

5. Agrawal, P; Katal, N; Hooda, N. Segmentation and classification of brain tumor using 3D-UNet deep neural networks. Int. J. Cogn. Comput. Eng.; 2022; 3, pp. 199-210. [DOI: https://dx.doi.org/10.1016/j.ijcce.2022.11.001]

6. Asiri, A. A. et al. Block-wise neural network for brain tumor identification in magnetic resonance images. Comput. Mater. Contin.73 (2022).

7. Lundervold, AS; Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Zeitschrift Fur Medizinische Physik; 2019; 29, pp. 102-127. [DOI: https://dx.doi.org/10.1016/j.zemedi.2018.11.002] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30553609]

8. Haq, E. U., Jianjun, H., Li, K., Ulhaq, H. & Zhang, T. An MRI-based deep learning approach for efficient classification of brain tumors. J. Ambient Intell. Humaniz. Comput.14, https://doi.org/10.1007/s12652-021-03535-9 (2021).

9. Asiri, AA et al. A novel inherited modeling structure of automatic brain tumor segmentation from mri. Comput. Mater. Contin.; 2022; 73, pp. 3983-4002.

10. Tataei Sarshar, N. et al. Glioma brain tumor segmentation in four mri modalities using a convolutional neural network and based on a transfer learning method. In Brazilian Technology Symposium, 386–402 (Springer, 2021).

11. Liu, Z et al. Deep learning based brain tumor segmentation: a survey. Complex Intell. Systems.; 2023; 9, pp. 1001-1026. [DOI: https://dx.doi.org/10.1007/s40747-022-00815-5]

12. Long, J., Shelhamer, E. & Darrell, T. Fully Convolutional Networks for Semantic Segmentation, https://doi.org/10.48550/arXiv.1411.4038 (2015).

13. Kaifi, R. A Review of Recent Advances in Brain Tumor Diagnosis Based on AI-Based Classification. Diagnostics; 2023; 13, 3007. [DOI: https://dx.doi.org/10.3390/diagnostics13183007] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37761373][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10527911]

14. Islam, MM et al. Transfer learning architectures with fine-tuning for brain tumor classification using magnetic resonance imaging. Healthc. Anal.; 2023; 4, [DOI: https://dx.doi.org/10.1016/j.health.2023.100270] 100270.

15. Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation (2015). ArXiv:1505.04597 [cs].

16. Siddique, N., Paheding, S., Elkin, C. P. & Devabhaktuni, V. U-Net and Its Variants for Medical Image Segmentation: A Review of Theory and Applications. IEEE Access9, 82031–82057, https://doi.org/10.1109/ACCESS.2021.3086020 (2021). Conference Name: IEEE Access.

17. Mehta, R. & Arbel, T. 3d u-net for brain tumour segmentation. In International MICCAI Brainlesion Workshop, 254–266 (Springer, 2018).

18. Cicek, O., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In Ourselin, S., Joskowicz, L., Sabuncu, M. R., Unal, G. & Wells, W. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, 424–432, https://doi.org/10.1007/978-3-319-46723-8_49 (Springer International Publishing, Cham, 2016).

19. Gitonga, M. M. Multiclass MRI Brain Tumor Segmentation using 3D Attention-based U-Net,https://doi.org/10.48550/arXiv.2305.06203 (2023). ArXiv:2305.06203 [cs, eess].

20. Asiri, AA et al. Brain Tumor Detection and Classification Using Fine-Tuned CNN with ResNet50 and U-Net Model: A Study on TCGA-LGG and TCIA Dataset for MRI Applications. Life; 2023; 13, 1449.2023Life..13.1449A [DOI: https://dx.doi.org/10.3390/life13071449] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37511824][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10381218]

21. Shedbalkar, J. & Prabhushetty, K. Deep transfer learning model for brain tumor segmentation and classification using UNet and chopped VGGNet. Indonesian J. Electr. Eng. Comput. Sci.33, 1405–1415, https://doi.org/10.11591/ijeecs.v33.i3.pp1405-1415 (2024).

22. Pravitasari, A. et al. UNet-VGG16 with transfer learning for MRI-based brain tumor segmentation. TELKOMNIKA Telecommunication Comput. Electron. Control.18, 1310, https://doi.org/10.12928/telkomnika.v18i3.14753 (2020).

23. Bensalah, H., Njeh, I., Slima, M., Farhat, N. & Mhiri, C. Vision transformers (ViT) and deep convolutional neural network (D-CNN)-based models for MRI brain primary tumors images multi-classification supported by explainable artificial intelligence (XAI). The Vis. Comput. 1–20, https://doi.org/10.1007/s00371-024-03524-x (2024).

24. Kolarik, M., Burget, R., Uher, V. & Povoda, L. Superresolution of MRI brain images using unbalanced 3D Dense-U-Net network. In 2019 42nd International Conference on Telecommunications and Signal Processing (TSP), 643–646, https://doi.org/10.1109/TSP.2019.8768829 (2019).

25. Khushi, HMT; Masood, T; Jaffar, A; Akram, S. A novel approach to classify brain tumor with an effective transfer learning based deep learning model. Braz. Arch. Biol. Technol.; 2024; 67, [DOI: https://dx.doi.org/10.1590/1678-4324-2024231137] e24231137.

26. Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale, https://doi.org/10.48550/arXiv.2010.11929 (2021). ArXiv:2010.11929 [cs].

27. Chen, J. et al. TransUNet: Transformers make strong encoders for medical image segmentation, https://doi.org/10.48550/arXiv.2102.04306 (2021). ArXiv:2102.04306 [cs].

28. Wang, W. et al. TransBTS: Multimodal brain tumor segmentation using transformer, https://doi.org/10.48550/arXiv.2103.04430 (2021). ArXiv:2103.04430 [cs].

29. Hatamizadeh, A. et al. UNETR: Transformers for 3D medical image segmentation. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 1748–1758, https://doi.org/10.1109/WACV51458.2022.00181 (IEEE, Waikoloa, HI, USA, 2022).

30. Cao, H. et al. Swin-Unet: Unet-like pure transformer for medical image segmentation, https://doi.org/10.48550/arXiv.2105.05537 (2021).

31. Aloraini, M et al. Combining the transformer and convolution for effective brain tumor classification using mri images. Appl. Sci.; 2023; 13, 3680.1:CAS:528:DC%2BB3sXmtlGnsrg%3D [DOI: https://dx.doi.org/10.3390/app13063680]

32. Khushi, HMT; Masood, T; Jaffar, A; Rashid, M; Akram, S. Improved multiclass brain tumor detection via customized pretrained efficientnetb7 model. IEEE Access; 2023; 11, pp. 117210-117230. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3325883]

33. Renugadevi, M et al. Machine learning empowered brain tumor segmentation and grading model for lifetime prediction. IEEE Access; 2023; 11, pp. 120868-120880. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3326841]

34. Nawaz, A. et al. Vgg-unet for brain tumor segmentation and ensemble model for survival prediction. In 2021 International Conference on Robotics and Automation in Industry (ICRAI), 1–6 (IEEE, 2021).

35. Tie, J; Peng, H; Zhou, J. Mri brain tumor segmentation using 3d u-net with dense encoder blocks and residual decoder blocks. Comput. Model. Eng. Sci.; 2021; 128, pp. 427-445.

36. Soh, WK; Yuen, HY; Rajapakse, JC. HUT: Hybrid UNet transformer for brain lesion and tumour segmentation. Heliyon; 2023; 9, [DOI: https://dx.doi.org/10.1016/j.heliyon.2023.e22412] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38046150][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10686892]e22412.

37. Cai, Y et al. Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution. BMC Med. Informatics Decis. Mak.; 2023; 23, 33. [DOI: https://dx.doi.org/10.1186/s12911-023-02129-z]

38. Hatamizadeh, A. et al. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images, https://doi.org/10.48550/arXiv.2201.01266 (2022). ArXiv:2201.01266 [cs, eess].

39. Cinar, N; Ozcan, A; Kaya, M. A hybrid DenseNet121-UNet model for brain tumor segmentation from MR Images. Biomed. Signal Process. Control.; 2022; 76, [DOI: https://dx.doi.org/10.1016/j.bspc.2022.103647] 103647.

40. Selvaraju, R. R. et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J.Comput. Vis.128, 336–359, https://doi.org/10.1007/s11263-019-01228-7 (2019). Number: 2 Publisher: Springer.

41. Asiri, AA et al. Exploring the Power of Deep Learning: Fine-Tuned Vision Transformer for Accurate and Efficient Brain Tumor Detection in MRI Scans. Diagnostics; 2023; 13, 2094. [DOI: https://dx.doi.org/10.3390/diagnostics13122094] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37370989][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10297056]

42. Asiri, AA et al. Advancing Brain Tumor Classification through Fine-Tuned Vision Transformers: A Comparative Study of Pre-Trained Models. Sensors; 2023; 23, 7913.2023Senso.23.7913A [DOI: https://dx.doi.org/10.3390/s23187913] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37765970][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10535333]

43. Wang, J; Lu, S-Y; Wang, S-H; Zhang, Y-D. RanMerFormer: Randomized vision transformer with token merging for brain tumor classification. Neurocomputing; 2024; 573, [DOI: https://dx.doi.org/10.1016/j.neucom.2023.127216] 127216.

44. Tummala, S; Kadry, S; Bukhari, SAC; Rauf, HT. Classification of Brain Tumor from Magnetic Resonance Imaging Using Vision Transformers Ensembling. Curr. Oncol.; 2022; 29, pp. 7498-7511. [DOI: https://dx.doi.org/10.3390/curroncol29100590] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36290867][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9600395]

45. Ribeiro, M. T., Singh, S. & Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier, https://doi.org/10.48550/arXiv.1602.04938 (2016). ArXiv:1602.04938 [cs, stat].

46. Lundberg, S. & Lee, S.-I. A unified approach to interpreting model predictions, https://doi.org/10.48550/arXiv.1705.07874 (2017). ArXiv:1705.07874 [cs, stat].

47. Menze, BH et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans. Med. Imaging.; 2015; 34, pp. 1993-2024. [DOI: https://dx.doi.org/10.1109/TMI.2014.2377694] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25494501]

48. Jun, C. Brain tumor dataset. Figshare; 2017; 10, m9.

Word count: 7008

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

This paper focuses on designing and developing novel architectures termed Hybrid Vision UNet-Encoder Decoder (HVU-ED) segmenter and Hybrid Vision UNet-Encoder (HVU-E) classifier for brain tumor segmentation and classification, respectively. The proposed model integrates the powerful feature extraction capabilities of hybrid methods like ResNet50, VGG16, Dense121 and Xception with Vision Transformer(ViT). These extracted hybrid features are fused with UNet features in the bottleneck and are passed to the HVU-ED decoder path for the segmentation task. In HVU-E, same features fed as input to the classification layer and machine learning algorithms such as SVM, RF, DT, Logistic Regression and AdaBoost. The proposed DenseVU-ED model obtained the highest segmentation accuracy of 98.91% with the BraTS2020 dataset. The highest dice score of 0.902 for the enhanced tumor, 0.954 for the core tumor, and 0.966 for the whole tumor were obtained. The DenseVU-E classifier achieved the highest accuracy of 99.18% with neural network classification and 92.21% accuracy with SVM on Figshare dataset. Grad-CAM, SHAP, and LIME techniques provide model interpretability, highlighting the models’ focus on significant brain areas and decision-making transparency. The proposed models outperform existing methods in segmentation and classification tasks.

Details

Title

A novel hybrid vision UNet architecture for brain tumor segmentation and classification

Author

Renugadevi, M.¹; Narasimhan, K.¹; Ramkumar, K.²; Raju, N.¹

¹ SASTRA Deemed University, School of Electrical and Electronics Engineering, Thanjavur, India (GRID:grid.412423.2) (ISNI:0000 0001 0369 3226)
² SASTRA Deemed University, School of Computing, Thanjavur, India (GRID:grid.412423.2) (ISNI:0000 0001 0369 3226)

Pages

23742

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20452322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41598-025-09833-y

ProQuest document ID

3226851966

A novel hybrid vision UNet architecture for brain tumor segmentation and classification

Jump to:

Full text

Abstract

Details

Suggested sources