Introduction
According to the World Health Organization, heart disease remains one of the leading causes of death in humans [1] Accurate and effective early diagnosis is therefore crucial for improving heart disease treatment outcomes and reducing mortality rates. In clinical settings, clinicians assess ventricular function by obtaining rapid and precise cardiac images. These images support the calculation of essential clinical indicators, including ventricular volume, ejection fraction (EF), and myocardial mass [2].
Echocardiography has become a widely used method for acquiring cardiac information due to its ability to generate spatiotemporal data in the form of videos. This approach enables clinicians to visualize dynamic changes within the heart and measure functional parameters such as EF, which reflect hemodynamic status and myocardial contractility. Notably, a lower EF often indicates more severe ventricular remodeling and diminished myocardial contractile function. However, poor-quality echocardiographic images can result in inaccurate EF measurements, delaying appropriate treatment and negatively affecting patient outcomes. Given these challenges, there is a growing need for automated and accurate assessment of left ventricular ejection fraction. Traditional echocardiographic analysis demands considerable time and expertise from cardiologists. In response, developers have introduced automated cardiac analysis algorithms to improve efficiency and accuracy. Early segmentation methods for the left ventricle focused on mathematical models, which achieved only limited success and were often validated solely on private datasets [3].
With the emergence of deep learning, however, echocardiographic analysis has advanced significantly, especially as these new methods can be trained and evaluated on larger public datasets. The emergence of deep learning has significantly advanced the field of echocardiographic analysis. The successful application of convolutional neural networks (CNNs) on large datasets such as ImageNet and ADE20K has led to deep learning methods becoming the mainstream solution for left ventricular segmentation in echocardiography. Furthermore, the public release of large-scale echocardiographic datasets like EchoNet-Dynamic [4] has enabled more effective training and evaluation of these models.
Recently, Vision Transformer (ViT) models [5] have demonstrated strong potential in computer vision and echocardiographic segmentation tasks. For example, Deng et al. [6] introduced a TransBridge network—composed of two CNNs connected by Transformer blocks—that enhances the model’s capacity to perceive multi-scale structures in ultrasound images through a bridging module. Additionally, Zeng et al. [7] proposed the MAEF-Net model, which incorporates multiple attention mechanisms to help the network focus on key image regions, especially those associated with the left ventricle, thus improving segmentation accuracy.
Building on these advances, Liao M et al. [8] explored the strengths of Transformer-based models for segmentation. They proposed two different architectures: one combining Swin Transformer with K-Net, and another using Segformer. Evaluations on the EchoNet-Dynamic dataset showed that these models achieved a Dice coefficient of 92.92%, even on challenging samples.
Despite these advances, Transformer-based models often require substantial computational resources, have slower inference speeds, and demand high-end hardware, which can restrict their use in some institutions and enterprises. Consequently, researchers now face the challenge of improving left ventricular segmentation accuracy while simultaneously reducing model complexity and hardware requirements to enable broader practical adoption.
Contribution to research
To solve this problem, this paper proposes an improved nested UNet model. The model incorporates a deeper feature extraction module (NestedUNet block) to enhance feature representation and enable multi-scale feature capture. At the same time, the UNet network is tailored to achieve a lightweight design, thereby reducing computational and storage overhead. Additionally, to reduce parameter redundancy, improve computational efficiency, and enhance local information, this paper incorporates the lightweight attention mechanism SimAM [9] and integrates it with CBAM. We evaluate our model on the EchoNet-Dynamic dataset for segmenting left ventricular structures during end-diastolic and end-systolic phases. The model achieves a Dice coefficient of 93.16% while maintaining a compact parameter size.
Novelties of this study include:
* An improved nested UNet structure is proposed to enhance feature expression and capture multi-scale features by introducing a deeper feature extraction module (NestedUNet block).
* The UNet network has been trimmed and designed to be lightweight to reduce computational and storage overhead.
* CBAM is fused with SimAM, a lightweight attention mechanism, to reduce parameter redundancy, improve computational efficiency, and enhance local information.
* Binary cross-entropy loss and Dice loss are combined to solve the problem of vanishing gradients or gradient explosions.
These innovations strike a balance between segmentation accuracy and computational efficiency, offering a robust solution for echocardiographic analysis.
Related research
In this section, some of the related work will be discussed. First, a brief review of previous ventricular segmentation methods, with a focus on CNN-based methods, is provided. Second, the recently popular Vision Transformer (ViT) and the basic deep learning networks related to the research in this paper are introduced.
Non-in-depth learning methods
Early non-deep learning models of left ventricular segmentation primarily targeted the identification and delineation of left ventricular membrane boundaries. Methods such as active contours [10] have achieved relatively effective segmentation in ultrasound images, but they rely on specific data formats and lacked scalable. Barbosa et al. [11] proposed a global anatomical affine optical flow and local recursive block matching technique. Their method used affine optical flow to capture the overall ventricular motion and employed recursive block matching to refine feature tracking in complex scenarios. Bernard et al. [12] compared nine segmentation methods on a relatively fair basis by evaluating the same dataset (RT3DE for 45 videos), including multiple metrics such as segmentation precision, computational time, and robustness. The experiments show satisfactory results, which proves the competitiveness of the method of Barbosa et al. Nonetheless, these traditional models still produce results that often deviate from cardiologist assessments, and they struggle to maintain robustness across diverse segmentation tasks on larger datasets.
In-depth learning methods
The widespread application of deep learning technology has driven the latest advances in the field of medical imaging. Convolutional neural network-based cardiac segmentation [13–15], three-dimensional convolutional network (3D CNN) [16], multi-fusion network [17,18], and variants of residual networks have become the most commonly used methods for various ventricular segmentation tasks.
For example, Baumgartner et al. [19] discussed the application of 2D and 3D deep learning techniques in cardiac magnetic resonance imaging (MRI) image segmentation and compared the effects of the two methods when processing cardiac MRI images. Their research clearly highlights the differences between network architectures in capturing local features and global contextual information, offering guidance for future UNet-related studies. Patravali et al. [20] proposed a cardiac MRI segmentation method combining 2D and 3D fully convolutional neural networks (FCNs), which uses 2D networks to capture local features and volumetric context information through 3D networks to improve segmentation accuracy. These studies reveal the important influence of network architecture on segmentation results.
In recent years, various deep learning network architectures have shown distinct strengths in medical image segmentation tasks. Among these, the fully convolutional neural network M-Net architecture proposed by Jang et al. [21] has demonstrated effective segmentation of the left ventricle, right ventricle, and myocardium, highlighting its considerable potential for clinical application. This finding contrasts with the study by Luo et al. [22], which used the UNet architecture to improve the segmentation of cardiac MRI by introducing a context extraction module and a segmentation module, combined with the information of the previous segmentation labels. These improvements ensure effective feature learning through the design of skip connections, reducing feature loss and gradient dispersion. Huang et al. [23] further optimized the segmentation effect by using full-scale skip connections to combine low-level details with high-level semantics of feature maps at different scales on the basis of UNet. The Double-UNet proposed by Jha et al. further improved the performance on various medical image segmentation datasets by stacking two UNets [24]. Moreover, to address the challenges posed by the computational complexity of the model, Liao et al. [25] proposed LightM-UNet, which is more suitable for deployment in resource-constrained environments by introducing a lightweight design that reduces computational and storage overhead while maintaining high-precision segmentation performance.
These findings collectively indicate that, despite the success of a large number of works based on other architectures, UNet and its variants are still evolving and showing great potential for medical image segmentation, especially in response to parameter changes and optimizing the use of computing resources.
In addition to CNN-based approaches, recent advances in the field of computer vision have also leveraged Transformer architectures. The recent success of Vision Transformer (ViT) [26] has led to the development of deep learning-based approaches in the field of computer vision, mainly by introducing attention mechanisms to handle a variety of tasks. After extensive pre-training, ViT achieves classification accuracy on the ImageNet dataset that rivals ResNet. However, as the complexity of these models increases, so does the need for their computing resources.
To address this challenge, Liu et al. [27] in 2021 proposed a hierarchical Transformer that uses a shifted window to compute. With this shift window scheme, the calculations are confined to non-overlapping local windows while allowing connections across windows, thus increasing efficiency while maintaining high accuracy. Cao et al. [28] implemented the Swin Transformer module in a UNet-like architecture and developed a multi-organ medical image segmentation framework for MR images, validating its effectiveness in segmentation tasks.
Although these emerging Transformer models excel in accuracy, their application in clinical settings faces challenges. Hatamizadeh et al. [29] noted that they often have high computational complexity and resource requirements. These characteristics result in the need for more computing resources, such as CPUs or GPUs, during the training and inference process of the model, which not only increases the cost of deployment, but also poses a challenge to real-time diagnosis and treatment in the hospital environment. Especially in clinical settings, timely feedback is essential for rapid diagnosis and treatment; however, the long inference time of the Transformer model may affect the efficiency of the physician and the treatment time of the patient. Therefore, although the Transformer model shows excellent performance in the research stage, its high computational complexity and inference time may lead to a delay in the system response and affect the decision-making process of doctors in practical applications, especially in real-time image analysis or auxiliary diagnosis systems. As a result, even if Transformer models perform well in theory, their feasibility in practical applications still faces significant challenges.
In summary, there is a clear need to balance segmentation accuracy with computational efficiency for clinical deployment. Therefore, based on the above questions, this study was inspired by [23–25] and the improved UNet was considered as a model architecture for the left ventricular segmentation task in echocardiograph. The aim is to address the computational complexity and inference time requirements in practical applications—especially in real-time image analysis or auxiliary diagnosis systems—thus improving the work efficiency of doctors and reducing the treatment time for patients.
Heart segmentation algorithm
This section presents a theoretical overview of the proposed encoder-decoder-based architecture. A comprehensive architectural design is provided, and the main building blocks and their functions are described.
An overview of the model structure
In this study, we construct a fusion model combining an improved UNet architecture (Network for Image Segmentation) and the SCBAM attention module for segmentation of left ventricular end-diastolic volume (EDV) and end-systolic volume (ESV) in echocardiography using a video dataset. The overall structure and workflow illustrates in Fig 1, providing a clear overview of the main network components. Fig 1 shows the overall network structure, which mainly includes DA-UNet++, NestedUNet, VGG module, and SCBAM. Each of these modules plays a specific role in the network.
[Figure omitted. See PDF.]
The network structure combines NestedUNet, the VGG module, and SCBAM. Its core is DA-UNet++, which is based on an encoder-decoder structure.
The network is based on a codec structure that processes images in DA-UNet++ by encoding (compressing information) and decoding (recovering information) features layer by layer. NestedUNet further enhances this encoding and decoding capability. The nested structure processes the feature maps of different resolutions through pooling operations and upsampling operations, ensuring that features of different scales ensure effective use, and passing the features in the encoder directly to the decoder through skip connections, helping to retain the high-resolution details of the nested structure. Thus, the network achieves efficient multi-scale feature extraction and high-resolution reconstruction. This is essentially the encoding and decoding process of UNet; the encoder extracts features layer by layer and the decoder recovers the spatial resolution layer by layer.
Feature extraction is further enhanced through the integration of the VGG (Deep Convolutional Neural Network Model) module. The VGG(Deep Convolutional Neural Network Model) module is used for feature extraction by stacking convolutional layers and batch normalization layers, adding SCBAM to further enhance the feature representation, each convolution operation keeps the spatial dimension of the input image unchanged, the number of channels of the output feature map changes according to the network configuration, the convolution operation can extract the edge information and feature information of the image, and the batch normalization helps to stabilize the training process.
The SCBAM module introduces attention mechanisms to focus on the most informative features. The SCBAM module combines channel attention, energy function, and spatial attention, with channel attention manipulation followed by SimAM [9]. The importance of each neuron is measured, and then spatial attention manipulation is performed to enhance feature representation in multiple dimensions.
DA-UNet++ network structure
The DA-UNet++ model introduces optimizations in many aspects on the basis of UNet++ to further improve the performance of medical image segmentation tasks [30] and significantly reduce the computational and storage overhead. UNet++ improves the encoder and decoder structure of traditional UNet, introduces dense connections and multi-scale feature fusion, thereby improving the segmentation performance. Specifically, UNet++ employs dense hopping connections to transfer information not only between encoders and decoders, but also between feature maps of the same resolution. The multi-scale feature fusion strategy in the decoder combines features from different levels to capture image details and global information more accurately, thereby improving the accuracy of segmentation. To further optimize the model, researchers applied structural pruning to UNet++, especially after the introduction of deep supervision, each sub-network (L1 ~ L4) [31] can output the segmentation results independently, which makes it possible to prune unnecessary parts in the testing stage, greatly reducing the network size, and reducing computing and storage overhead. The feature fusion operation in the first layer can be represented as:
(1)
Thereinto, g indicates a feature fusion operation, Represents a feature map passed from the encoder.
On this basis, we propose an improved nested UNet structure (NestedUNet) is proposed. Through a nested design, it enhances segmentation performance, the network is able to capture more expressive features. First, 1 × 1 convolution is performed on the input to adjust the number of channels and preserve the original input information. Then, the spatial dimension of the input is halved by the maximum pooling layer, and the feature map feed into the first VGG module for two convolution operations. Then, the spatial dimension of the feature map was doubled by transposing the convolution for upsampling, and the upsampled feature map is concatenated with the previously saved map in the channel dimension, and the convolution operation was performed twice in the second VGG module again. At the same time, the SCBAM module is introduced for optimization. Nested UNet captures richer contextual information by fusing feature maps at different scales through multi-level pooling and upsampling operations. This multi-scale feature fusion strategy significantly improves the model’s performance in segmentation tasks, especially in object edge recognition and complex detail handling. Meanwhile, the Nested UNet architecture extracts hierarchical features through convolutional layers at different depths, capturing both local spatial information and preserving global spatial information. These features it extracts layer by layer, from low-level details (such as edges and textures) to high-level semantic information (such as objects and segmented regions), reflecting different levels of abstraction and further enhancing segmentation accuracy. In addition, the introduction of the VGG module enables the model to extract and retain more low-level edge information, reduce the ambiguous area in the segmentation result, and further improve the accuracy.
In summary, a more efficient DA-UNet++ model is proposed by optimizing the structure and lightweight design of UNet. By cropping redundant parts, introducing a deeper feature extraction module, strengthening multi-scale feature fusion, and adding VGG modules at key locations, the optimized model shows excellent performance in medical image segmentation tasks, which not only improves the segmentation accuracy, but also significantly reduces the computing and storage overhead, providing the possibility for deployment in the actual medical environment.
Attention mechanism SCBAM
SCBAM combines the advantages of CBAM and SimAM to reduce parameter redundancy, improve computational efficiency, and enhance local information without increasing the model size, as illustrated in Fig 2. Specifically, the Channel Attention (CA) module enhances overall performance by highlighting important channel features in the feature map. The SimAM module further refines and smooths local features, while the Spatial Attention (SA) module strengthens the spatial information in the feature map.
[Figure omitted. See PDF.]
SCBAM fuses the SimAM and CBAM attention modules to reduce parameter redundancy, improve computational efficiency, and enhance local information, all without increasing the model size.
CBAM enhances feature representation by extracting both channel and spatial attention features. The channel attention mechanism focuses on the most important channels, improving the representation of key feature channels. The spatial attention mechanism, on the other hand, targets specific regions of the image, thereby enhancing spatial features. The channel attention mechanism helps the model concentrate on the most relevant features for the task, while spatial attention emphasizes critical areas of the image.
SimAM, in contrast, emphasizes the local information of each pixel in the feature map. It highlights important features while reducing noise interference through a parameter-free energy function calculation. This localized focus eliminates the need for global computations, thereby improving computational efficiency. SimAM adjusts the attention of each feature map by calculating its variance, enabling the model to focus adaptively on areas with higher variance, typically associated with edges or transitions between different structures. These mechanisms help the model focus on more meaningful information at various levels, improving segmentation accuracy.
By combining these attention mechanisms, the model can more effectively focus on key regions and important channels in the image, thereby improving segmentation accuracy. To prevent the SCBAM attention mechanism from significantly reducing the model’s inference speed, this study employs the following optimization methods:
* Pruned Nested UNet Structure: By reducing the number of convolutional layers and pruning redundant convolutional kernels and neurons, the computational overhead is decreased, the model size is reduced, and inference speed is enhanced.
* Adjusting the Kernel Size of the Attention Mechanism: When using the spatial attention module of CBAM, smaller convolutional kernels are employed to reduce the computational burden while maintaining effective attention mechanisms.
CBAM
The Convolutional Block Attention Module (CBAM) is an attention mechanism designed for convolutional neural networks. It combines channel attention and spatial attention to enhance the representational power of feature maps. CBAM improves the performance of the model by adaptively weighting the input feature map to highlight the important features and suppress irrelevant ones. CBAM consists of two sub-modules: the Channel Attention Module (CAM) and the Spatial Attention Module (SAM), as shown in Fig 3.
[Figure omitted. See PDF.]
CBAM (Convolutional Block Attention Module) enhances the representation ability of feature maps by combining channel attention and spatial attention.
CBAM is similar to how our brains focus attention on important parts when observing something. Imagine you are looking at a busy street filled with people, vehicles, and buildings. If you are searching for a friend, your brain naturally concentrates on the crowd, ignoring irrelevant elements like parked cars or buildings. This is exactly what CBAM does in neural networks—it helps the model focus on “important” features, such as distinguishing people from vehicles in an image.
CAM captures the relationship between channels through global average pooling and global maximum pooling, generates channel attention, and applies it to the input feature map ,The following steps are used to calculate the Channel Attention module:
Global average pooling and global maximum pooling
(2)(3)
Where H and W represent the height and width of the feature map.
Shared network: The two descriptors are passed through a shared multilayer perceptron (MLP) to generate a channel attention map .
(4)
where is the Sigmoid activation function, and MLP consists of two fully connected layers, and the output dimension of the first fully connected layer is (r is the rate of reduction), The output dimension of the second fully connected layer is .
Attention Weighting: Multiply the input feature map by the channel attention map.
(5)
Thereinto, ⊙ Represents multiplication by element.
The SAM module captures the relationship between spatial locations through global average pooling and global maximum pooling along the channel dimension to generate a spatial attention map. Given the input feature map, the calculation steps of the Spatial Attention Module are as follows:
The global average pooling and global maximum pooling in the channel direction give two descriptors and.
(6)(7)
Join and convolutional operations: Connect these two descriptors in the channel dimension and generate a spatial attention map through the convolutional layer.
(8)
Where is the Sigmoid activation function, represents the connection operation at the channel dimension, Conv is a convolutional layer of .
Attention weighting: Multiply the input feature map by the spatial attention map.
(9)
The overall architecture of CBAM consists of the two modules mentioned above, which generate the final weighted feature map by first performing channel attention and then spatial attention.
SimAM
The SimAM (Simple, Parameter-Free Attention Module) [9] draws its design inspiration from neuroscience theory. It aims to enhance the representational capacity of convolutional neural networks while maintaining a lightweight structure without introducing additional parameters. SimAM addresses two major limitations of existing attention modules: first, their attention weights are typically restricted to either the channel or spatial dimension; second, they often require extra parameters, reducing flexibility and increasing model complexity.
Imagine being in a classroom where a teacher is explaining a topic. Some students (features) are more focused because they have studied more and understood the material more deeply. SimAM is like a teacher who can recognize which students are more attentive and concentrate more attention on those students, thereby enhancing the overall learning effectiveness. It gives more attention to students who have a deeper understanding (important features) while ignoring those who are less engaged (irrelevant features).
The SimAM module calculates attention weights by optimizing an energy function. The specific steps are as follows:
Definition of the energy function: Inspired by the concept of spatial inhibition in neuroscience, SimAM defines an energy function that quantifies the importance of each neuron.
(10)
Thereinto, and are the mean and variance of the channel, respectively, is a hyperparameter.
Computation of attention weights: The importance weights of each neuron are calculated by solving the closed-form solution of the energy function.
(11)
Thereinto, is the sum of squares of the deviations of the feature map, is the channel variance.
Reweighting of feature maps: Apply the attention weights to reweight the input feature maps to obtain the output feature maps.
(12)
Thereinto, is the sigmoid function
Loss function
The loss function in this paper consists of two types of loss functions: Dice Loss and Binary Cross-Entropy Loss.
Dice Loss is a loss function based on the Dice Similarity Coefficient (DSC). DSC measures the similarity between two sets and is commonly used in image segmentation to evaluate the degree of overlap between the predicted segmentation result and the ground truth. The formula is as follows:
(13)
Where A and B are two sets, in image segmentation, A is the predicted segmentation result, and B is the ground truth. They correspond to the predicted and true values of the first pixel, respectively.
DICE losses are defined as:
(14)
Binary Cross-Entropy (BCE) Loss quantifies the difference between two probability distributions, and is particularly suitable for binary classification problems. In the image segmentation task, BCE Loss evaluates the difference between the predicted and true values of each pixel. The formula is as follows:
(15)
Thereinto, N is the total number of samples, is the real label of the ith sample, is the predicted probability of the first sample.
The loss function used in this paper combines Dice Loss and BCE Loss to optimize both simultaneously and balance their impact on the final segmentation result. The formula for the combined loss function is as follows:
(16)
Thereinto, is the weight factor, Control the contribution of Dice Loss and BCE Loss to the final loss.
Experiments and analysis
This section describes the dataset used in the experiment. Then, To verify the effectiveness of the module or component, three sets of ablation experiments were were conducted. Finally, the evaluation metrics used are briefly described. All experiments were run in a Pytorch environment on an NVIDIA GeForce RTX 4090TiGPU. The learning rate is set to 1e-4, the batch_size is 2, and the epoch is 50.
Data set
The dataset used in this paper, EchoNet-Dynamic, comes from the Echocardiography Laboratory and the Center for Artificial Intelligence in Medical Imaging (AIMI) at Stanford University. It is a standard fully resting echocardiogram consisting of 50–100 video and still images that visualize the heart from different angles, positions, and image acquisition techniques (2D images, tissue Doppler images, color Doppler images, etc.). In this dataset, a vertex 4-chamber 2D grayscale video is extracted from each study. Each video represents a unique individual. The representative frames of the EchoNet-Dynamic dataset are displayed in five independent videos, totaling 11 frames after removing ECG data, text labels, and ultrasound acquisition information, as shown in Fig 4. The dataset contains 10,036 echocardiogram videos from 10,036 random individuals who underwent echocardiography between 2006 and 2018.
[Figure omitted. See PDF.]
After removing ECG data, text, and collected information, 11 frames of images were extracted from 5 videos, each representing an independent individual.
The apical 4-chamber view video was identified by extracting a medical digital imaging and communication (DICOM) file associated with ventricular volume measurements, which were used to calculate ejection fraction in the apical ventricular view. Each study was linked to clinical measurements and calculations obtained by a registered sonographer and validated by a level 3 echocardiographer in a standard clinical workflow. The left ventricular ejection fraction, a core indicator of cardiac function. It is used to diagnose cardiomyopathy, evaluate eligibility for certain chemotherapy treatments, and determine indications for medical devices. Left ventricular ejection fraction is significantly associated with mortality in many disease states, with a lower ejection fraction correlating with a worse prognosis [32]. The ejection fraction is calculated as (EDV – ESV)/ EDV, as shown in Fig 5. Therefore, this study proposes a DA-UNet++ model to segment more accurate ESV and EDV from video data to calculate left ventricular ejection fraction, providing clinicians with more accurate assistance.
[Figure omitted. See PDF.]
The calculation formula for ejection fraction is (EDV – ESV)/ EDV.
Ablation experiments
To evaluate the method in this paper and verify its performance under different settings, a series of ablation experiments were conducted. These experiments include variations in attention mechanisms, different scales of tailoring optimization, different loss functions, and the necessity of each module in the overall framework.
Different kinds of attention mechanisms
This study evaluates the influence of SCBAM components on the model by testing how the model can focus more on important features and locations, thereby improving its performance in image segmentation and detection tasks. The effects of different attention mechanisms on the model are also tested. In this study, the segmentation results of SE, CBAM, ECA, and SCBAM are shown in Table 1. The results show that SCBAM has the best segmentation performance, as it effectively focuses on important features and locations, which leads to superior results.
[Figure omitted. See PDF.]
Cropping optimization at different scales
To evaluate the impact of different scales of UNet++ on the model, this study tests how reducing the network scale can decrease computing and storage overhead while maintaining accuracy, thereby improving performance in image segmentation and detection tasks. As shown in Table 2, L3 pruning significantly outperforms the unpruned UNet++ (UNet-l4) and the network scale is significantly smaller than that of the traditional UNet++ network, enabling accurate segmentation of video data based on L3 pruning.
[Figure omitted. See PDF.]
Different loss functions
To establish a reasonable loss function calculation method, grid search was used to perform ablation experiments on the loss function parameters, as shown in Table 3. The results show that the best performance is achieved by combining Dice Loss and BCE Loss, so this combination is used as the loss function in this study. It is verified that the effect is optimal when the combination of Dice Loss and BCE Loss is set to 0.8, as shown in equation (13).
[Figure omitted. See PDF.]
The role of each module in the overall framework
Based on UNet++, SCBAM modules, cropping optimization, and NestedUNet nesting modules were added to evaluate the necessity of each component, as shown in Table 4. With the addition of each module, the segmentation results improve, and the complete method achieved the best DSC and IoU results, demonstrating the effectiveness of each module in enhancing segmentation performance.
[Figure omitted. See PDF.]
Evaluation indicators
The evaluation indicators used in this study are DSC (Dice Similarity Coefficient) and IoU (Intersection over Union), both commonly used in image segmentation tasks as they effectively and intuitively measure segmentation quality. Each indicator has its own unique advantages and applicability in evaluating segmentation results.
DSC is more concerned with the balance of overlapping areas, and its formula is:
(17)
Thereinto, is the set of the predicted segmentation results. is a collection of the true split result. Indicates the number of elements in the intersection of the predicted result and the true result.
The IoU more rigorously evaluates the degree of matching between the predicted and real regions, and its formula is:
(18)
Thereinto, indicates the number of elements in the union of the predicted and true results.
Experimental results
To evaluate the DA-UNet++ architecture, its performance was compared with that of seven other models, performing both quantitative and qualitative analyses of the results.
Quantitative results
To evaluate the segmentation performance of DA-UNet++ on the EchoNet-Dynamic dataset, comparative experiments were conducted using Deeplabv3, Deeplabv3 + , UNet++, UNet-l2, ResNet, FCN, and PSPNet. In the experiment, Dice similarity coefficient and IoU were used as evaluation metrics to measure segmentation accuracy, and the number of floating-point operations (FLOPs) and parameter count (Params) were used to assess computational complexity. The experimental results of DA-UNet++ on the EchoNet-Dynamic dataset are shown in Table 5.
[Figure omitted. See PDF.]
As shown in Table 5, the DA-UNet++ model achieved the best performance in both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU) on the EchoNet-Dynamic dataset. Specifically, the DSC reached 0.9316 (95% CI: 0.9303–0.9331), and the IoU was 0.4655 (95% CI: 0.4645–0.4664). Although the improvement in DSC over the original Deeplab model was 1.05%, further analysis using confidence intervals demonstrated the statistical significance and robustness of these results. In addition, the DA-UNet++ model’s DSC was 1.03% higher than that of the original Deeplab model, while its parameter count was only 15% of Deeplab’s, indicating significant efficiency gains.
Comparative analysis of inference speed and computational efficiency was also conducted. On an NVIDIA RTX 4090 GPU, the original Deeplab model achieved an inference speed of 6.48 images per second, with a computational cost of 34.76 GFLOPs and 9.16 million parameters. In contrast, the improved DA-UNet++ model increased the inference speed to 7.95 images per second, reduced computational cost to 25.74 GFLOPs, and decreased the parameter count to 5.96 million. These results demonstrate that the proposed model not only substantially reduces computational complexity but also significantly improves inference speed, making it better suited for practical deployment.
To visually illustrate the advantages of DA-UNet++ over other models, Fig 6 compares the models in terms of parameter quantity (x-axis) and segmentation performance (DSC, y-axis) on the EchoNet-Dynamic dataset. As shown, DA-UNet++ achieves the best segmentation performance while consuming fewer computing resources than competing models. Notably, although the UNet-l2 model has the fewest parameters, its DSC is only 0.07% higher than that of the traditional model, further highlighting the superior balance of efficiency and performance achieved by DA-UNet++.
[Figure omitted. See PDF.]
The comparison between DA-UNet+ and other models in terms of parameter count and segmentation performance.
Qualitative results
In order to demonstrate the effect of this method on the segmentation of the left ventricle with variable shape and size, this study used six different models to segment the images, and selected the images of two individuals for display. This is shown in Figs 7 and 8, Figs 7b and 8b are the EDV and ESV of the heart structure outline annotated by the original model, respectively, and Figs 7g and 8g are the EDV and ESV of the heart outline annotated using the method in this paper, respectively. Compared with the previous method, the DSC of the proposed method reached 93.16%, indicating that the proposed method was more accurate in cardiac segmentation. It can be seen from the visualization results that the proposed method is more complete for the segmentation of the left ventricle, which can effectively solve the problem of large changes in the scale of the left ventricle and improve the accuracy of left ventricular segmentation.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Conclusion
In order to solve the problems of variable ventricular shape and size and blurred tissue boundaries, a fusion and improved DA-UNet++ model for left ventricular segmentation was proposed. The algorithm optimizes the UNet module, integrates the NestedUNet network, extracts rich global and local features, and reduces the computational and storage overhead. In addition, the model integrates the SCBAM attention mechanism, which recalibrates the features while increasing the information extraction ability and encoding multi-scale feature information, allowing the network to pay more attention to useful features, thereby enhancing the feature extraction ability of ventricular video data and improving the performance of ventricular segmentation. The segmentation performance of the proposed method was verified and evaluated on the EchoNet-Dynamic dataset, and the results showed that compared with the original model DeepLabv3 for medical image segmentation, the DSC was improved by 1.03%, and the model size was only 15% of the original model. This paper aims to solve the problem of limited practical medical auxiliary applications due to the high demand for computing resources and the large scale of the model in clinical applications, which has certain clinical application value.
While the EchoNet-Dynamic dataset provides a robust foundation for model training and evaluation, generalization to other medical datasets remains a key concern. The Pruned Nested UNet with SCBAM demonstrates versatility, making it suitable for various medical image segmentation tasks requiring fine feature extraction and attention mechanisms. However, challenges remain when applying the model to new datasets. Many medical datasets, especially in specialized fields like echocardiography, are limited in size, resolution, or annotation quality. Additionally, models trained on one type of imaging, such as cardiac ultrasound, may not transfer well to others like X-rays or MRIs due to differences in contrast, noise, and resolution. Furthermore, while attention-based models excel in performance, their interpretability still needs improvement, a key requirement for clinical applications.
Future work will focus on the following aspects to further enhance the clinical value and applicability of the model: (1) Evaluating the model’s generalization performance on other imaging modalities such as MRI and CT to verify its cross-modality adaptability; (2) Exploring deployment and optimization on edge devices (e.g., portable ultrasound equipment) to achieve real-time and efficient segmentation; (3) Improving the interpretability of the model to enhance its credibility and applicability in clinical practice. Through these research directions, the DA-UNet++ model is expected to expand its application scope and practical value in a variety of medical imaging scenarios.
Supporting information
S1 File. The main code of this article.
https://doi.org/10.1371/journal.pone.0325794.s001
(ZIP)
S2 File. Dataset.
https://doi.org/10.1371/journal.pone.0325794.s002
(ZIP)
References
1. 1. Tsao CW, Aday AW, Almarzooq ZI, Alonso A, Beaton AZ, Bittencourt MS, et al. Heart disease and stroke statistics—2022 update: a report from the American Heart Association. Circulation. 2022;145(8):e153-639.
* View Article
* Google Scholar
2. 2. Li FY, Li W, Gao X, Xiao B. A Novel Framework With Weighted Decision Map Based on Convolutional Neural Network for Cardiac MR Segmentation. IEEE J Biomed Health Inform. 2022;26(5):2228–39. pmid:34851840
* View Article
* PubMed/NCBI
* Google Scholar
3. 3. Petitjean C, Dacher J-N. A review of segmentation methods in short axis cardiac MR images. Med Image Anal. 2011;15(2):169–84. pmid:21216179
* View Article
* PubMed/NCBI
* Google Scholar
4. 4. Ouyang D, He B, Ghorbani A, Yuan N, Ebinger J, Langlotz CP, et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature. 2020;580(7802):252–6. pmid:32269341
* View Article
* PubMed/NCBI
* Google Scholar
5. 5. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X,Unterthiner T, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv 2020:2010.11929.
* View Article
* Google Scholar
6. 6. Deng K, Meng Y, Gao D, Bridge J, Shen Y, Lip G, et al. Transbridge: A lightweight transformer for left ventricle segmentation in echocardiography. In: Simplifying Medical Ultrasound: Second International Workshop, ASMUS 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings. 2021;63–72.
7. 7. Zeng Y, Tsui PH, Pang K, Bin G, Li J, Lv K, et al. MAEF-Net: Multi-attention efficient feature fusion network for left ventricular segmentation and quantitative analysis in two-dimensional echocardiography. Ultrasonics. 2023;127:106855.
* View Article
* Google Scholar
8. 8. Liao M, Lian Y, Yao Y, Chen L, Gao F, Xu L, et al. Left Ventricle Segmentation in Echocardiography with Transformer. Diagnostics (Basel). 2023;13(14):2365. pmid:37510109
* View Article
* PubMed/NCBI
* Google Scholar
9. 9. Yang L, Zhang RY, Li L, Xie X. Simam: A simple, parameter-free attention module for convolutional neural networks. In: International conference on machine learning. 2021;11863–74.
10. 10. Chen Y, Tagare HD, Thiruvenkadam S, Huang F, Wilson D, Gopinath KS, et al. Using prior shapes in geometric active contours in a variational framework. International J Computer Vision. 2002;50:315–28.
* View Article
* Google Scholar
11. 11. Barbosa D, Friboulet D, D’hooge J, Bernard O. Fast tracking of the left ventricle using global anatomical affine optical flow and local recursive block matching. MIDAS J. 2014;10:17–24.
* View Article
* Google Scholar
12. 12. Bernard O, Bosch JG, Heyde B, Alessandrini M, Barbosa D, Camarasu-Pop S, et al. Standardized evaluation system for left ventricular segmentation algorithms in 3D echocardiography. IEEE Transactions on Medical Imaging. 2015;35(4):967–77.
* View Article
* Google Scholar
13. 13. Payer C, Štern D, Bischof H, Urschler M. Multi-label whole heart segmentation using CNNs and anatomical label configurations. In: International Workshop on Statistical Atlases and Computational Models of the Heart. Cham: Springer International Publishing; 2017;190–8.
14. 14. Xu Z, Wu Z, Feng J. CFUN: Combining faster R-CNN and U-net network for efficient whole heart segmentation. arxiv preprint. 2018.
* View Article
* Google Scholar
15. 15. Tong Q, Ning M, Si W, Liao X, Qin J. 3D deeply-supervised U-net based whole heart segmentation. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017. Quebec City, Canada: Springer International Publishing; 2018. 224–32.
16. 16. Yang X, Bian C, Yu L, Ni D, Heng PA. 3D convolutional networks for fully automatic fine-grained whole heart partition. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada, September 10-14, 2017, Revised Selected Papers. Springer International Publishing; 2018. 181–9.
17. 17. Ye C, Wang W, Zhang S, Wang K. Multi-depth fusion network for whole-heart CT image segmentation. IEEE Access. 2019;7:23421–9.
* View Article
* Google Scholar
18. 18. Mortazi A, Burt J, Bagci U. Multi-planar deep segmentation networks for cardiac substructures from MRI and CT. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada: Springer International Publishing; 2018. 199–206.
19. 19. Baumgartner CF, Koch LM, Pollefeys M, Konukoglu E. An exploration of 2D and 3D deep learning techniques for cardiac MR image segmentation. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada; Springer International Publishing; 2018;111–9.
20. 20. Patravali J, Jain S, Chilamkurthy S. 2D-3D fully convolutional neural networks for cardiac MR segmentation. In: International workshop on statistical atlases and computational models of the heart. Springer International Publishing. 2017;130–9.
21. 21. Jang Y, Hong Y, Ha S, Kim S, Chang HJ. Automatic segmentation of LV and RV in cardiac MRI. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017. Quebec City, Canada; 2018. 161–9.
22. 22. Luo C, Shi C, Li X, Gao D. Cardiac MR segmentation based on sequence propagation by deep learning. PLoS One. 2020;15(4):e0230415. pmid:32271777
* View Article
* PubMed/NCBI
* Google Scholar
23. 23. Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, et al. Unet 3 : A full-scale connected unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020;1055–9.
24. 24. Jha D, Riegler MA, Johansen D, Halvorsen P, Johansen HD. DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS). 2020.;558–64.
* View Article
* Google Scholar
25. 25. Liao W, Zhu Y, Wang X, Pan C, Wang Y, Ma L. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arxiv preprint. 2024.
* View Article
* Google Scholar
26. 26. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. 2020.
* View Article
* Google Scholar
27. 27. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;10012–22.
28. 28. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision, 2022;205–18.
29. 29. Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, et al. Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022;574–84.
30. 30. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. Unet++: A nested u-net architecture for medical image segmentation. In: In Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, proceedings. Springer International Publishing;2018;3–11.
31. 31. Zhou Z, Siddiquee MM, Tajbakhsh N, Liang J. Unet: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging. 2019;39(6):1856–67.
* View Article
* Google Scholar
32. 32. Shah KS, Xu H, Matsouaka RA, Bhatt DL, Heidenreich PA, Hernandez AF, et al. Heart failure with preserved, borderline, and reduced ejection fraction: 5-year outcomes. J Am Coll Cardiol. 2017;70(20):2476–86. pmid:29141781
* View Article
* PubMed/NCBI
* Google Scholar
Citation: Cao K, Zhao M, Geng M, Zheng S, Jung H (2025) Left ventricular segmentation method based on optimized UNet and improved CBAM: ESV and EDV tracking study. PLoS One 20(6): e0325794. https://doi.org/10.1371/journal.pone.0325794
About the Authors:
Kerang Cao
Roles: Writing – original draft, Writing – review & editing
Affiliations: College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang, China, Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province, Shenyang, China,
Miao Zhao
Roles: Writing – original draft, Writing – review & editing
Affiliations: College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang, China, Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province, Shenyang, China,
Minghui Geng
Roles: Software
Affiliations: College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang, China, Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province, Shenyang, China,
Shuai Zheng
Roles: Visualization
Affiliations: College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang, China, Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province, Shenyang, China,
Hoekyung Jung
Roles: Writing – review & editing
E-mail: [email protected]
Affiliation: Computer Engineering Department, Paichai University, Daejeon, Korea
ORICD: https://orcid.org/0000-0002-7607-1126
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
1. Tsao CW, Aday AW, Almarzooq ZI, Alonso A, Beaton AZ, Bittencourt MS, et al. Heart disease and stroke statistics—2022 update: a report from the American Heart Association. Circulation. 2022;145(8):e153-639.
2. Li FY, Li W, Gao X, Xiao B. A Novel Framework With Weighted Decision Map Based on Convolutional Neural Network for Cardiac MR Segmentation. IEEE J Biomed Health Inform. 2022;26(5):2228–39. pmid:34851840
3. Petitjean C, Dacher J-N. A review of segmentation methods in short axis cardiac MR images. Med Image Anal. 2011;15(2):169–84. pmid:21216179
4. Ouyang D, He B, Ghorbani A, Yuan N, Ebinger J, Langlotz CP, et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature. 2020;580(7802):252–6. pmid:32269341
5. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X,Unterthiner T, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv 2020:2010.11929.
6. Deng K, Meng Y, Gao D, Bridge J, Shen Y, Lip G, et al. Transbridge: A lightweight transformer for left ventricle segmentation in echocardiography. In: Simplifying Medical Ultrasound: Second International Workshop, ASMUS 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings. 2021;63–72.
7. Zeng Y, Tsui PH, Pang K, Bin G, Li J, Lv K, et al. MAEF-Net: Multi-attention efficient feature fusion network for left ventricular segmentation and quantitative analysis in two-dimensional echocardiography. Ultrasonics. 2023;127:106855.
8. Liao M, Lian Y, Yao Y, Chen L, Gao F, Xu L, et al. Left Ventricle Segmentation in Echocardiography with Transformer. Diagnostics (Basel). 2023;13(14):2365. pmid:37510109
9. Yang L, Zhang RY, Li L, Xie X. Simam: A simple, parameter-free attention module for convolutional neural networks. In: International conference on machine learning. 2021;11863–74.
10. Chen Y, Tagare HD, Thiruvenkadam S, Huang F, Wilson D, Gopinath KS, et al. Using prior shapes in geometric active contours in a variational framework. International J Computer Vision. 2002;50:315–28.
11. Barbosa D, Friboulet D, D’hooge J, Bernard O. Fast tracking of the left ventricle using global anatomical affine optical flow and local recursive block matching. MIDAS J. 2014;10:17–24.
12. Bernard O, Bosch JG, Heyde B, Alessandrini M, Barbosa D, Camarasu-Pop S, et al. Standardized evaluation system for left ventricular segmentation algorithms in 3D echocardiography. IEEE Transactions on Medical Imaging. 2015;35(4):967–77.
13. Payer C, Štern D, Bischof H, Urschler M. Multi-label whole heart segmentation using CNNs and anatomical label configurations. In: International Workshop on Statistical Atlases and Computational Models of the Heart. Cham: Springer International Publishing; 2017;190–8.
14. Xu Z, Wu Z, Feng J. CFUN: Combining faster R-CNN and U-net network for efficient whole heart segmentation. arxiv preprint. 2018.
15. Tong Q, Ning M, Si W, Liao X, Qin J. 3D deeply-supervised U-net based whole heart segmentation. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017. Quebec City, Canada: Springer International Publishing; 2018. 224–32.
16. Yang X, Bian C, Yu L, Ni D, Heng PA. 3D convolutional networks for fully automatic fine-grained whole heart partition. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada, September 10-14, 2017, Revised Selected Papers. Springer International Publishing; 2018. 181–9.
17. Ye C, Wang W, Zhang S, Wang K. Multi-depth fusion network for whole-heart CT image segmentation. IEEE Access. 2019;7:23421–9.
18. Mortazi A, Burt J, Bagci U. Multi-planar deep segmentation networks for cardiac substructures from MRI and CT. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada: Springer International Publishing; 2018. 199–206.
19. Baumgartner CF, Koch LM, Pollefeys M, Konukoglu E. An exploration of 2D and 3D deep learning techniques for cardiac MR image segmentation. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada; Springer International Publishing; 2018;111–9.
20. Patravali J, Jain S, Chilamkurthy S. 2D-3D fully convolutional neural networks for cardiac MR segmentation. In: International workshop on statistical atlases and computational models of the heart. Springer International Publishing. 2017;130–9.
21. Jang Y, Hong Y, Ha S, Kim S, Chang HJ. Automatic segmentation of LV and RV in cardiac MRI. In: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017. Quebec City, Canada; 2018. 161–9.
22. Luo C, Shi C, Li X, Gao D. Cardiac MR segmentation based on sequence propagation by deep learning. PLoS One. 2020;15(4):e0230415. pmid:32271777
23. Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, et al. Unet 3 : A full-scale connected unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020;1055–9.
24. Jha D, Riegler MA, Johansen D, Halvorsen P, Johansen HD. DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS). 2020.;558–64.
25. Liao W, Zhu Y, Wang X, Pan C, Wang Y, Ma L. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arxiv preprint. 2024.
26. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. 2020.
27. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;10012–22.
28. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision, 2022;205–18.
29. Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, et al. Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022;574–84.
30. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. Unet++: A nested u-net architecture for medical image segmentation. In: In Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, proceedings. Springer International Publishing;2018;3–11.
31. Zhou Z, Siddiquee MM, Tajbakhsh N, Liang J. Unet: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging. 2019;39(6):1856–67.
32. Shah KS, Xu H, Matsouaka RA, Bhatt DL, Heidenreich PA, Hernandez AF, et al. Heart failure with preserved, borderline, and reduced ejection fraction: 5-year outcomes. J Am Coll Cardiol. 2017;70(20):2476–86. pmid:29141781
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 Cao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
This paper introduces an optimized nested UNet model for automated left ventricular segmentation in cardiac function assessment. We utilize the EchoNet-Dynamic dataset, which contains both video data and expert annotations. Unlike conventional methods such as DeepLabv3 that struggle with large model sizes and imprecise segmentation, Our proposed model introduces a deeper feature extraction module to effectively capture multi-scale features and reduce computational overhead. By integrating the CBAM (Attention module) attention mechanism and a lightweight SimAM (Simple Attention Module) module, we enhance feature selectivity and minimize redundancy. To further stabilize training and address gradient issues, we combine binary cross-entropy and Dice loss functions. Experimental results reveal that our model significantly outperforms existing methods, achieving a 1.05% increase in the Dice coefficient and reducing model size to 15% of the original. These improvements not only enhance the accuracy of cardiac function assessments but also provide a more efficient solution for automated diagnosis in clinical practice.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer