Full Text

Turn on search term navigation

1. Introduction

The accurate segmentation of medical images, such as computed tomography and magnetic resonance imaging, is pivotal for clinical applications ranging from disease diagnosis to treatment planning [1,2,3]. Such segmentation can assist with the detection of regions of interest (ROIs) in medical images and the assessment of the morphological characteristics of the regions. A large number of image segmentation methods have been proposed in the last decade [4,5], and most of them are based on deep learning technology [6]. These deep learning-based methods have achieved remarkable segmentation performance in fully supervised settings, but their performance heavily relies on large-scale labeled image data. Manual pixel-level labeling, however, is labor-intensive and time-consuming, especially for medical images where expert knowledge is required. This greatly restricts the applications of these deep learning-based segmentation methods [7,8,9]. To alleviate the scarcity of labeled data, different deep learning technologies have been proposed to take advantage of unlabeled image data, such as self-supervised learning [10,11], semi-supervised learning [12,13,14], and weakly supervised learning [15,16]. Among these technologies, semi-supervised learning (SSL) has emerged as a promising paradigm, leveraging both labeled and unlabeled data to enhance model generalization.

Many SSL methods have been proposed to fully integrate a small amount of labeled data and a large amount unlabeled data for accurate image segmentation. For example, Tarvainen et al. [14] proposed the classical mean teacher (MT) method by using the exponential moving average (EMA) scheme to align a student model and a teacher model, which were enforced to have consistent predictions for unlabeled images. Wang et al. [17] improved the MT by introducing a unique model-level residual perturbation and an exponential Dice (eDice) loss. Yu et al. [18] proposed an uncertainty-aware mean teacher (UA-MT) method by using entropy uncertainty maps to filter out unreliable boundary prediction by the teacher model. Sukesh et al. [19] improved the UA-MT by using a pre-trained denoising auto-encoder (DAE) to generate uncertainty maps and reduce the overhead of computational resources. Li et al. [20] developed a multi-task deep learning network and introduced an adversarial loss between the predicted signed distance maps (SDM) of labeled and unlabeled data. Luo et al. [21] proposed a dual-task consistency semi-supervised method by explicitly establishing task-level regularity. Shi et al. [22] utilized different decoders to generate certainty and uncertainty object regions and help a student network to learn from them with different network weights. These semi-supervised segmentation methods have the potential to handle various medical images and obtain promising applications, but they may suffer from relatively large segmentation errors (especially in object boundary regions). This is probably due to the fact that (1) the EMA can lead to a tight coupling between the network weights of the student and teacher models, making the two models have very similar predictions for unlabeled images and thus suppressing the learning potential of the student model from the predictions of the teacher model. (2) The boundary regions of target objects are not effectively processed by the student and teacher models or existing uncertainty strategies in these semi-supervised methods, thus leading to relatively large segmentation errors.

In this paper, we developed a novel semi-supervised learning method (called PE-MT) for accurate image segmentation based on the UA-MT by introducing a perturbation-enhanced EMA (pEMA) and a residual-guided uncertainty map (RUM) to overcome the drawbacks of the traditional EMA and entropy uncertainty map (EUM). The pEMA was used to provide proper network weights for both the student and teacher models and alleviate the coupling effect between them via the modulus operator, while the RUM was used to highlight the unreliable prediction in the boundary regions of target objects leveraging a unique uncertainty quantitative formula and force the student model to focus on the other regions. With the two components, our developed method is expected to have reasonable potential to handle medical images with varying modalities and obtain promising segmentation performance, as compared to the UA-MT and several semi-supervised methods.

2. Method

2.1. Scheme Overview

Figure 1 shows the developed semi-supervised segmentation method by introducing the pEMA and RUM to improve the learning potential of the teacher and student models in the available UA-MT. The two models share the same network backbone (e.g., U-Net or V-Net), but their network weights are updated through distinct mechanisms. Specifically, the teacher’s weights are obtained using the student’s weights from different training steps through the pEMA, which not only enables the teacher model to capture the information learned by the students but also reduces the coupling between the teacher and student models. With the obtained weights, the teacher model can generate a prediction for each unlabeled image. These predictions are then thresholded by the RUM to filter out unreliable regions and used as pseudo-labels for unlabeled images. With these pseudo-labels, the student model can extract a great number of discriminative features from a small number of labeled images and a large number of unlabeled images for segmentation purposes, leveraging the supervised and unsupervised losses. Minimizing these two losses enables the student and teacher models to achieve very similar segmentation performance.

2.2. Semi-Supervised Segmentation

To minimize the supervised and unsupervised losses, we trained the developed semi-supervised method based on a training set consisting of $N$ labeled images and $M$ unlabeled images. The labeled and unlabeled images can be represented by $S_{l} = {(x_{i}, y_{i})}_{i = 1}^{N}$ and $S_{u} = {x_{i}}_{i = 1}^{M}$ , respectively, where $x_{i}$ and $y_{i} \in R^{H \times W \times D}$ denote the involved image and label (i.e., ground truth) with specific dimensions of height $H$ , width $W$ , and depth $D$ . With the images and labels, the total loss function for our developed method can be defined as follows:

(1) $L_{t o t a l} = \sum_{i = 1}^{N} L_{s} (p_{i}^{s} (x_{i}, θ), y_{i}) + λ \sum_{i = 1}^{N + M} L_{u} (p_{i}^{s} (x_{i}, θ, η_{1}), p_{i}^{t} (x_{i}, θ^{*}, η_{2}))$

(2) $\begin{matrix} L_{s} & = C E (p_{i}^{s} (x_{i}, θ), y_{i}) + D i c e (p_{i}^{s} (x_{i}, θ), y_{i}) \\ = - \frac{1}{| Ω |} \sum_{Ω} y_{i} \log (p_{i}^{s} (x_{i}, θ)) + 1 - \frac{\sum_{Ω} 2 p_{i}^{s} (x_{i}, θ) y_{i}}{\sum_{Ω} p_{i}^{s} (x_{i}, θ) + \sum_{Ω} y_{i}} \end{matrix}$

(3) $L_{u} = \frac{1}{| Ω |} \sum_{Ω} | p_{i}^{s} (x_{i}, θ, η_{1}) - p_{i}^{t} (x_{i}, θ^{*}, η_{2}) |^{2}$

where

θ

and

θ^{*}

denote the network weights of the student and teacher models,

η_{1}

and

η_{2}

denote small random noises added to labeled and unlabeled images, respectively.

p_{i}^{s} (x_{i}, θ)

p_{i}^{s} (x_{i}, θ, η_{1})

, and

p_{i}^{t} (x_{i}, θ^{*}, η_{2})

indicate the predictions of image

x_{i}

obtained by the student and teacher models under small random noises,

η_{1}

, and

η_{2}

, respectively.

L_{s}

is the supervised loss, comprising the cross entropy (CE) and Dice loss functions [23], and

| Ω |

denotes the number of pixels in the image domain

Ω

L_{u}

is the unsupervised losses and used to assess the consistency between predictions

p_{i}^{s} (x_{i}, θ, η_{1})

and

p_{i}^{t} (x_{i}, θ^{*}, η_{2})

based on the pixel-wise mean-squared error (MSE).

λ

is a scalar factor used to keep the balance between

L_{s}

and

L_{u}

and is often set to

λ (t) = 0.1 e^{- 5 {(1 - t / t_{\max})}^{2}}

according to previous studies [14,17], where

t

and

t_{\max}

denote the current and maximum iteration number during network training, respectively. For simplification, the predictions of image

x_{i}

obtained by the student and teacher models are represented by

p_{i}^{s}

and

p_{i}^{t}

, respectively.

2.3. The pEMA

The pEMA was derived from the EMA and used to provide a small weight perturbation for the student model so that it could obtain better accuracy and generalization capability in image segmentation. The EMA and pEMA can be separately given by

(4) $θ_{t}^{*} = α θ_{t - 1}^{*} + (1 - α) θ_{t}$

(5) $\{\begin{matrix} θ_{t}^{*} = α θ_{t - 1}^{*} + (1 - α) θ_{t} \\ θ_{t} = θ_{t} + β \mod (θ_{t}^{*}, θ_{t}) \end{matrix}$

where

θ_{t}^{*}

denotes the network weight of the teacher model obtained by the EMA based on the student’s weight

θ_{t}

at the training step

t

α

and

β

are two different scalar factors.

\mod (\cdot)

is the element-wise modulus operator. Based on the two formulas, it can be seen that in the original EMA, the student’s weight was obtained on a small number of labeled images and the teacher’s weight was merely derived from the student’s weights at different training steps. This calculation scheme made the teacher’s weight very similar to the student’s weight, thus limiting the efficient utilization of unlabeled images. Conversely, in the pEMA, the student model is updated based not only on labeled data at the current training step but also on a given residual perturbation between the student and teacher weights via the modulus operator. This can, to some extent, make the two models have different network weights and thus alleviate the coupling effect between them. On the other hand, the residual perturbation was closely associated with both the student and teacher weights and adaptively changed as the network was trained, which gave the pEMA the potential to improve the segmentation performance of the two models.

2.4. The RUM

The RUM was constructed based on multiple forward passes [24] of the teacher model under random image-level perturbation (e.g., dropout and noise) to show its prediction reliability for desirable objects depicted on unlabeled images. It can be given by

(6) $R U M (υ) = \sum_{c = 1}^{C} ({\bar{p}}_{c}^{{(1 - {\bar{p}}_{c})}^{υ}} - {\bar{p}}_{c})$

(7) ${\bar{p}}_{c} = \frac{1}{K} \sum_{k = 1}^{K} p_{k, c}^{t}$

where

p_{k, c}^{t}

denotes the

k - t h

forward pass of the teacher model for class

c

in unlabeled image

x

, and

K

and

C

are the total number of forward passes and classes.

υ

is a scalar coefficient and used to adjust the mean prediction probability

{\bar{p}}_{c}

in the RUM. With the unique quantitative formula, our uncertainty map had better capability to locate image regions with high uncertainty (especially boundary regions of target objects) and highlight the prediction unreliability for these regions, as compared to the original entropy uncertainty maps (EUMs) in the UA-MT, which was widely used in previous studies and defined as follows:

(8) $E U M = - \sum_{c = 1}^{C} {\bar{p}}_{c} \log ({\bar{p}}_{c})$

Figure 2 illustrates the differences between the RUM and EUM based on the prediction probability $p$ of a pixel for a segmentation task with a class number of 2 (i.e., $p$ for the desirable region, and $(1 - p)$ for the background). According to Equations (7) and (8), the RUM and EUM have similar quantization curves and reach their corresponding maximum uncertainty value at a probability of 0.5 since a probability of 0.5 is often used to decide whether a pixel belongs to object regions or not in deep learning. However, the RUM has a larger maximum at a probability of 0.5 and its curve has a steeper slope, suggesting that our RUM can quickly and accurately locate uncertainty prediction regions and then exclude these regions in the unsupervised loss.

Based on the introduced RUM, we can enhance the consistency between the predictions of the student and teacher models for unlabeled images by filtering out image regions with high uncertainty in the unsupervised loss:

(9) $L_{u} = \frac{\sum_{Ω} I (R U M (υ) < τ) | p^{s} - p^{t} |^{2}}{\sum_{Ω} I (R U M (υ) < τ)}$

where

τ

is a given uncertainty threshold and assigned to

(0.75 + 0.25 e^{- 5 (1 - t / t_{\max})}) \log (2)

in the UA-MT, and

I

is a member function.

3. Experiments and Results

3.1. Dataset and Evaluation Metrics

In this study, we used the Left Atrial (LA) Segmentation Challenge (LASC) dataset [25] and the Automated Cardiac Diagnosis Challenge (ACDC) dataset [26] to validate the developed method. The LASC dataset consists of 100 3D gadolinium-enhanced MRI scans (GE-MRIs) and their corresponding segmentation labels, both of which have an isotropic resolution of 0.625 × 0.625 × 0.625 mm³. These GE-MRIs were normalized to zero mean and unit variance and divided into 80 scans for network training and 20 scans for performance validation, following previous studies [18]. The ACDC dataset contains both end-diastolic and systolic-phase short-axis cardiac cine-MRI scans of 100 patients and their corresponding segmentation masks for three different tissue regions, including left ventricle (LV), myocardium (Myo), and right ventricle (RV). These data were divided into 70 and 30 patients’ scans for network training and validation, respectively. Because of the large spacing between short-axis slices and the possible inter-slice shift caused by respiratory motion, we used U-Net to segment each slice separately, as recommended by previous studies [27]. Figure 3 illustrates the images and their corresponding labels from the LASC and ACDC datasets.

We used the available V-Net [8,18] and U-Net [7] as backbone networks for LA and cardiac segmentation, respectively, and assessed their performance [28] leveraging the Dice similarity coefficient (DSC), Jaccard coefficient (JAC), 95% Hausdorff Distance (HD), and average surface distance (ASD), all of which were available in the MedPy library (https://github.com/loli/medpy) (assessed on 27 November 2024) and defined as follows:

(10) $D S C = \frac{\sum_{Ω} 2 p y}{\sum_{Ω} (p + y)}$

(11) $J A C = \frac{\sum_{Ω} p y}{\sum_{Ω} (p + y - p y)}$

(12) $H D = \max (h d (p, y), h d (y, p))$

(13) $A S D = \frac{1}{| S (p) |} (\sum_{a \in S (p)} \min_{b \in S (y)} | | a - b | |)$

where

p

and

y

denote the prediction of a given image and its corresponding label, respectively.

S (\cdot)

is the set of surface voxels/pixels in an image.

‖a - b‖

is the distance from point

a

to point

b

h d (p, y) = \max_{b \in S (y)} (\min_{a \in S (p)} | | a - b | |)

is the directed HD from

p

y

. The DSC and JAC metrics are scored from 0 to 1, where higher values denote better segmentation accuracy. Conversely, ASD and HAD are distance-based metrics (measured in pixels), where values start above 0, and lower values correspond to smaller segmentation errors.

3.2. Implementation Details

We implemented the developed method via PyTorch (version 1.9.1) on a platform with an NVIDIA GeForce RTX 2080 SUPER GPU for two different segmentation tasks, based on public codes available from https://github.com/HiLab-git/SSL4MIS (assessed on 12 September 2024), and trained it three times with a fixed random seed, without any pretrained weights. During training, the network parameters were updated by the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01 and a maximum iteration number of 6000. The learning rate was decayed by a factor of 0.1 every 2500 iterations. The batch sizes were set to 4 and 24 for the LASC and ACDC datasets, respectively, where the number of labeled and unlabeled images was equal. The parameters $β$ and $υ$ were set to 0.001 and 2, respectively. Other parameters were set as follows: $α = \min (1 - 1 / t, 0.99)$ , $K = 8$ , and $C = 2$ , following previous configurations for the UA-MT. During training, we randomly cropped the LA region from the LASC dataset with dimensions of 112 × 112 × 80 voxels, resized ACDC slices to 256 × 256 pixels, and augmented these images randomly (e.g., rotation and flip) to avoid over-fitting. In addition, we compared the developed method with four semi-supervised segmentation methods (i.e., MT [14], UA-MT [18], SASSNet [20], and DTC [21]) to demonstrate its effectiveness and accuracy. For a fair comparison, the involved methods were based on the same backbones (i.e., V-Net and U-Net) and trained on two different proportions of labeled and unlabeled images from the training sets (n = 80 and 70) in the LASC and ACDC datasets to demonstrate their segmentation performance and reliance on labeled data. Specifically, 5% (10%) of the images in the training sets were used as labeled data and 95% (90%) of the images were used as unlabeled data for network training. After training, their performances were independently assessed based on the validation sets of the LASC and ACDC datasets, following previous studies [14,17,18].

3.3. Segmentation of the LASC Dataset

Table 1 demonstrates the results of the involved semi-supervised methods based on the V-Net backbone and the validation set of the LASC dataset for the LA segmentation. It can be seen from the results that (1) our developed method obtained an average DSC of 0.8341 and 0.8729 when trained on 5% and 10% labeled data, respectively. It outperformed the MT (0.7916 and 0.8631), UA-MT (0.8080 and 0.8648), SASSNet (0.8137 and 0.8623), and DTC (0.8067 and 0.8679) based on the same backbone and experimental dataset. This showed the advantages of the developed method in image segmentation over the other four semi-supervised methods. (2) All the involved methods had better segmentation performance than V-Net (0.5043 and 0.7610), which was solely trained on the involved labeled images in a fully supervised manner, suggesting the importance of unlabeled images in the semi-supervised learning framework. (3) These semi-supervised methods had an increased segmentation performance when trained on more labeled images and gradually approached the performance of V-Net trained on all the labeled images in a fully supervised manner. Figure 4 illustrates the segmentation results of the involved methods for four different images from the LASC dataset.

3.4. Segmentation of the ACDC Dataset

Table 2 shows the results of our developed method based on the U-Net backbone and the validation set for segmenting the RV, Myo, and LV regions from the ACDC dataset in the first experiment. As demonstrated by the results, our developed method had an increased performance in the semi-supervised segmentation framework when trained on more labeled images and could compete with the U-Net in the fully supervised framework. Specifically, our developed method had a average DSC of 0.4166, 0.5635, and 0.6864 for the RV, Myo, and LV, respectively, when trained on 5% labeled data, and 0.6199, 0.7932, and 0.8482 for the three regions when trained on 10% labeled data. It was superior to U-Net for three different objects on average when merely using 5% and 10% labeled data for network training, as shown in Table 2.

Table 3 summarizes the average segmentation results of the involved semi-supervised methods for three different experiments based on the U-Net backbone and ACDC dataset. As shown by the results, these semi-supervised methods had an improved segmentation performance on the validation set of the ACDC dataset when using more labeled images for network training, and they gradually approached the fully supervised results of U-Net trained on all the labeled images. However, they had very different capabilities in extracting three object regions from the ACDC dataset. Specifically, our developed method had an average DSC of 0.5555 and 0.7538 for three different regions (i.e., the LV, Myo, and RV) when trained on 5% and 10% labeled data, respectively. It was superior to the MT (0.5457 and 0.7483) and UA-MT (0.5383 and 0.7385) but inferior to the DTC (0.5601 and 0.7842) and SASSNet (0.5897 and 0.8108) under the same experiment conditions. Figure 5 illustrates the segmentation results of the involved methods for four different images from the ACDC dataset.

3.5. Ablation Study

3.5.1. Effect of the pEMA and RUM

Table 4 summarizes the impact of the pEMA and RUM on the performance of the UA-MT for two different segmentation tasks by using the two components to replace the EMA and EUM (note that UA-MT can be viewed as a combination of the EMA, EUM, student, and teacher models, while PE-MT was a variant of UA-MT created by introducing the pEMA and RUM). It can be seen that the UA-MT had a consistently increased average performance for different object regions depicted on the LASC and ACDC datasets when introducing the pEMA and RUM to replace their corresponding original versions (i.e., EMA and EUM). This suggested the effectiveness of our introduced pEMA and RUM, as compared to the EMA and EUM. Figure 6 shows the difference between the RUM and EUM in semi-supervised image segmentation. It can be seen that our introduced RUM can effectively identify and highlight unreliable prediction regions and suppress the adverse impact of background information far away from desirable objects, while the EUM detected lots of background regions, especially those close to object boundary regions.

3.5.2. The Parameters $υ$ and $β$

Table 5 and Table 6 separately show the impact of the parameters $υ$ and $β$ in RUM and pEMA on the performance of the developed method in two specific segmentation experiments based on the LASC and ACDC datasets. As demonstrated by these results, our developed method achieved better overall performance when setting the parameter $υ$ to 2 for both the LASC and ACDC datasets. Fixing this parameter value of 2, our developed method obtained higher accuracy when setting the parameter $β$ to 0.001 for the same segmentation tasks, as shown in Table 6.

4. Discussion

In this paper, we proposed a novel semi-supervised learning method (PE-MT) based on the UA-MT and validated it by extracting multiple cardiac regions from the public LASC and ACDC datasets. The experimental results showed that our developed method can effectively extract desirable object regions by leveraging two available network backbones (i.e., V-Net and U-Net), and it obtained promising segmentation accuracy, owing to the introduction of the pEMA and RUM when trained on 5% (10%) labeled images and 95% (90%) unlabeled ones from the training sets. It was superior to the MT and UA-MT and could compete with the SASSNet and DTC when trained on the same number of labeled and unlabeled images from the LASC and ACDC datasets. Moreover, our methods tended to have increased segmentation accuracy when trained on more labeled and unlabeled images and was able to rapidly process an unseen image at the inference stage (around 1 s).

Our developed method was derived from the UA-MT and superior to it under the same experimental conditions. This was mainly attributed to the introduction of the RUM and pEMA. The RUM had a reasonable capability to accurately identify some regions with high uncertainty in the prediction maps of unlabeled images obtained by the teacher model. By eliminating these prediction regions of high uncertainty, the student and teacher models were able to highlight reliable prediction regions in the calculation of the unsupervised loss and thus improve the prediction accuracy and consistency of the student and teacher models for unlabeled images. This largely enhanced the segmentation potential of the two models and excluded the impact of irrelevant information on the final performance. Moreover, the performance can be further enhanced leveraging the introduced pEMA since it was able not only to provide proper network weights for the teacher model but also to increase the learning flexibility of the student model by adding a random weight perturbation to suppress the coupling effect between the two models. The learning flexibility can to some extent facilitate the detection of various object features and increase the use efficiency of label information.

Despite promising performance, our developed method was inferior to the SASSNet and DTC when extracting three different object regions from the ACDC datasets. This may be due to the fact that (1) our developed method merely employed the V-Net and U-Net to segment desirable objects and did not involve additional network branches or auxiliary learning tasks in image segmentation. In contrast, both the SASSNet and DTC used multiple network branches to simultaneously extract desirable objects and their corresponding signed distance maps in a mutually collaborative manner. This can enhance the learning procedure of specific neural networks due to the introduction of extra network parameters and auxiliary processing tasks and hence improve image segmentation accuracy. (2) V-Net and U-Net had limited learning capability and network parameters (see their structures at https://github.com/HiLab-git/SSL4MIS) (assessed on 12 September 2024) and could not capture enough convolutional features for segmentation purposes when trained on a very small number of labeled images (e.g., three and seven patients’ scans). (3) Labeled images were much less than unlabeled ones in segmenting the cardiac regions from the ACDC dataset, which may have lead to very large data distribution differences (or domain shifts) between the two types of images. These differences made our method subject to relatively severe performance degradation, as compared with the SASSNet and DTC.

Finally, there were some limitations to this study. First, our developed method was merely validated based on the plain network backbones (e.g., V-Net and U-Net), which had relatively limited learning capability, as compared with other deep learning architectures such as Transformers and multi-layer perceptrons (MLPs). This can largely suppress its segmentation performance and clinical application potential. Second, only a few data augmentation schemes (e.g., rotation and flip) were used in the segmentation experiments, potentially making our developed method have low accuracy in segmenting different medical images with varying modalities. Third, both the LASC and ACDC datasets contained a very small number of images and were further split into training and testing sets. This may have caused our developed method to be unable to capture various convolutional features associated with target objects, and thus, it underwent a rapid performance degradation when labeled images were reduced in the training set. Last but not least, our developed method was not validated for dynamical image segmentation [29], which aims to process multiple different images at multiple different instances of time or in multiple videos [30,31]. The incomplete performance validation not only limits the potential applications of the developed algorithm but also restricts its popularization. Despite these limitations, our model achieved promising segmentation performance on two public image datasets and surpassed the UA-MT under the same experimental configuration.

5. Conclusions

We developed a novel semi-supervised learning method (termed PE-MT) for accurate image segmentation based on a small number of labeled data and a large number of unlabeled data. Its novelty lies in the introduction of the pEMA and RUM and their integration with the available UA-MT. The pEMA extended the original EMA and added an adaptive weight perturbation to the student model in order to enhance its learning flexibility and effectiveness, while the RUM alleviated the drawbacks of the EUM in the UA-MT via a quantitative uncertainty formula and was used to filter out some prediction regions with high uncertainty. Extensive segmentation experiments on the public LASC and ACDC datasets demonstrated that the developed method was able to effectively extract desirable objects when trained on a small number of labeled images and a large number of unlabeled images and outperformed the MT and UA-MT under the same experimental configuration.

Author Contributions

Conceptualization, Q.Y. and L.W.; methodology, L.W.; software, W.W. (Wenquan Wang) and Z.L.; validation, X.Z., G.J. and Y.W.; formal analysis, W.W. (Wenquan Wang) and Z.L.; investigation, G.J., B.T. and S.Y.; resources, M.H. and X.X.; data curation, W.W. (Wencan Wu) and Q.Y.; writing—original draft preparation, W.W. (Wenquan Wang) and Z.L.; writing—review and editing, Q.Y. and L.W.; visualization, G.J. and B.T.; supervision, L.W. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding authors.

Acknowledgments

We would like to thank the anonymous reviewers for their helpful remarks that improved this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Overview of the developed semi-supervised segmentation method based on the UA-MT by introducing the unique PEMA and RUM schemes.

Figure 2 Differences between the RUM and EUM based on the prediction probability of a voxel/pixel for all the classes, where the parameter $υ$ was set to 1, 2, and 3, respectively.

Figure 3 Illustration of images and labels in the LASC (top row) and ACDC (bottom row) datasets, respectively, where LA, Myo, LV, and RV denote the left atrium, myocardium, and left and right ventricles, respectively.

Figure 4 Segmentation results of four given images obtained by the V-Net, MT, UA-MT, SASSNet, DTC, and PE-MT, respectively, which were trained on 10% (in the first two columns) and 5% (in the last two columns) of labeled images from the LASC dataset. The red lines represent the object boundaries of the LA labels, and the yellow arrows indicate the poor segmentation.

Figure 5 Segmentation results of four different images obtained by the U-Net, MT, UA-MT, SASSNet, DTC, and PE-MT, respectively, using 10% (in the first two columns) and 5% (in the last two columns) labeled images from the ACDC dataset. The red lines represent the ground-truth boundaries, and the yellow arrows indicate the poor segmentation.

Figure 6 From top to bottom, the labels of three given images and their corresponding uncertainty maps obtained by the RUM and EUM in each row are shown, respectively, where red circle highlighted irrelevant background regions.

Table 1

The LA segmentation results on the validation set in terms of the average DSC, JAC, HD and ASD, leveraging the involved methods, which were trained on different proportions of labeled data and unlabeled images from the training set of the LASC dataset.

Method	Number of Images		Metrics
Method	Labeled	Unlabeled	DSC	JAC	HD	ASD
V-Net	80	0	0.9178	0.8485	4.7179	1.5867
V-Net	4	0	0.5043	0.3972	36.3690	11.0264
MT	4	76	0.7916	0.6631	24.8149	7.0991
UA-MT	4	76	0.8080	0.6868	21.7672	6.5760
SASSNet	4	76	0.8137	0.6924	27.8814	8.0149
DTC	4	76	0.8067	0.6856	26.6678	7.5836
PE-MT	4	76	0.8341	0.7225	18.9836	5.0198
V-Net	8	0	0.7610	0.6527	26.9073	4.8357
MT	8	72	0.8631	0.7612	17.9738	4.5731
UA-MT	8	72	0.8648	0.7638	16.7100	4.3400
SASSNet	8	72	0.8623	0.7612	13.1187	3.7583
DTC	8	72	0.8679	0.7692	11.6410	3.3986
PE-MT	8	72	0.8729	0.7758	13.1082	3.8202

Table 2

The cardiac segmentation results on the validation set in terms of the average DSC, JAC, HD, and ASD, leveraging the developed method and U-Net in the first experiment, which were trained on different proportions (i.e., 5% and 10%) of labeled data and unlabeled images from the training set of the ACDC dataset.

Dataset	Method	Number of Images		Metrics
Dataset	Method	Labeled	Unlabeled	DSC	JAC	HD	ASD
RV	U-Net	3	0	0.3930	0.2836	63.1196	30.3970
	PE-MT	3	67	0.4166	0.2998	62.2174	26.3911
	U-Net	7	0	0.6323	0.5096	24.0267	8.4186
	PE-MT	7	67	0.6199	0.4994	18.4767	6.1613
Myo	U-Net	3	0	0.5145	0.3983	20.1485	6.9656
	PE-MT	3	67	0.5635	0.4432	18.5294	7.0502
	U-Net	7	0	0.7943	0.6704	8.6746	2.2788
	PE-MT	7	63	0.7932	0.6675	9.7917	2.9752
LV	U-Net	3	0	0.5607	0.4430	56.9506	21.5382
	PE-MT	3	67	0.6864	0.5819	38.3050	13.7716
	U-Net	7	0	0.8403	0.7427	29.9437	8.5729
	PE-MT	7	63	0.8482	0.7511	34.2763	9.3469

Table 3

The cardiac segmentation results on the validation set in terms of the average DSC, JAC, HD, and ASD, leveraging the involved semi-supervised methods and U-Net, which were trained on different proportions of labeled data and unlabeled images from the training set of the ACDC dataset for three experiments.

Method	Number of Images		Metrics
Method	Labeled	Unlabeled	DSC	JAC	HD	ASD
U-Net	70	0	0.8807	0.7936	6.4722	1.8963
U-Net	3	0	0.4894	0.3750	46.7396	19.6336
MT	3	67	0.5457	0.4333	43.9185	17.3452
UA-MT	3	67	0.5383	0.4272	41.3736	16.0410
SASSNet	3	67	0.5897	0.4752	23.3788	8.5670
DTC	3	67	0.5601	0.4511	26.4061	11.1162
PE-MT	3	67	0.5555	0.4416	39.6839	15.7376
U-Net	7	0	0.7556	0.6409	20.8817	6.4234
MT	7	63	0.7483	0.6340	20.2368	5.6540
UA-MT	7	63	0.7385	0.6199	21.0633	5.9992
SASSNet	7	63	0.8108	0.7074	12.3803	3.6314
DTC	7	63	0.7842	0.6842	10.1061	3.0190
PE-MT	7	63	0.7538	0.6393	20.8482	6.1611

Table 4

Performance of the UA-MT trained on 10% labeled data and 90% unlabeled data from the training set in the LASC and ACDC datasets by using the pEMA and RUM to replace the EMA and EUM, respectively.

Dataset	Method	Number of Images		Metrics
Dataset	Method	Labeled	Unlabeled	DSC	JAC	HD	ASD
LASC	UA-MT	8	72	0.8648	0.7638	16.7100	4.3400
	UA-MT + RUM	8	72	0.8724	0.7753	14.4020	3.7612
	UA-MT + RUM + pEMA	8	72	0.8729	0.7758	13.1082	3.8202
ACDC	UA-MT	7	63	0.7385	0.6199	21.0633	5.9992
	UA-MT + RUM	7	63	0.7429	0.6237	25.2195	7.3287
	UA-MT + RUM + pEMA	7	63	0.7538	0.6393	20.8482	6.1611

Table 5

Performance of the PE-MT when setting different values for the parameter $υ$ in the RUM for two different segmentation tasks.

Dataset	$υ$	Number of Images		Metrics
Dataset	$υ$	Labeled	Unlabeled	DSC	JAC	HD	ASD
LASC	1	8	72	0.8615	0.7586	16.4457	3.9698
	2	8	72	0.8724	0.7753	14.4020	3.7612
	3	8	72	0.8631	0.7623	14.7983	3.7027
ACDC	1	7	63	0.7229	0.6109	21.0683	6.5155
	2	7	63	0.7429	0.6237	25.2195	7.3287
	3	7	63	0.7297	0.6142	25.6428	7.4772

Table 6

Performance of the PE-MT when setting different values for the parameter $β$ in pEMA for two different segmentation tasks.

Dataset	$β$	Number of Images		Metrics
Dataset	$β$	Labeled	Unlabeled	DSC	JAC	HD	ASD
LASC	0.005	8	72	0.7440	0.6084	21.5900	5.4993
	0.001	8	72	0.8729	0.7758	13.1082	3.8202
	0.0005	8	72	0.8590	0.7550	17.7567	4.6438
	0.0001	8	72	0.8630	0.7616	18.6198	4.5289
ACDC	0.005	7	63	0.7026	0.5746	31.8241	11.4408
	0.001	7	63	0.7538	0.6393	20.8482	6.1611
	0.0005	7	63	0.7248	0.6077	22.9209	6.6283
	0.0001	7	63	0.7449	0.6246	25.7077	7.4602

References

1. Wang, Y.; Zhou, Y.; Shen, W.; Park, S.; Fishman, E.; Yuille, A. Abdominal multi-organ segmentation with organ-attention networks and statistical fusion. Med. Image Anal.; 2019; 55, pp. 88-102. [DOI: https://dx.doi.org/10.1016/j.media.2019.04.005] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31035060]

2. Luo, X.; Wang, G.; Song, T.; Zhang, J.; Zhang, S. MIDeepSeg: Minimally interactive segmentation of unseen objects from medical images using deep learning. Med. Image Anal.; 2021; 72, 102102. [DOI: https://dx.doi.org/10.1016/j.media.2021.102102] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34118654]

3. Wang, G.; Zuluaga, M.; Li, W.; Rosalind, P.; Patel, P.; Michael, A.; Tom, D.; Divid, A.; Jan, D.; Sebastien, O. DeepIGeoS: A deep interactive geodesic framework for medical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.; 2018; 41, pp. 1559-1572. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2840695] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29993532]

4. Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 44, pp. 3523-3542. [DOI: https://dx.doi.org/10.1109/TPAMI.2021.3059968] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33596172]

5. Jiao, R.; Zhang, Y.; Ding, L.; Xue, B.; Zhang, J.; Cai, R.; Jin, C. Learning with limited annotations: A survey on deep semi-supervised learning for medical image segmentation. Comput. Biol. Med.; 2024; 169, 107840. [DOI: https://dx.doi.org/10.1016/j.compbiomed.2023.107840] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38157773]

6. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw.; 2015; 61, pp. 85-117. [DOI: https://dx.doi.org/10.1016/j.neunet.2014.09.003] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25462637]

7. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Munich, Germany, 5–9 October 2015; pp. 234-241. [DOI: https://dx.doi.org/10.1007/978-3-319-24574-4_28]

8. Milletari, F.; Navab, N.; Ahmadi, S. V-net: Fully convolutional neural networks for volumetric medical image segmentation. Proceedings of the International Conference on 3D Vision (3DV); Stanford, CA, USA, 25–28 October 2016; pp. 565-571. [DOI: https://dx.doi.org/10.1109/3DV.2016.79]

9. Dong, B.; Wang, W.; Fan, D.; Li, J.; Fu, H.; Shao, L. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv; 2021; arXiv: 2108.06932[DOI: https://dx.doi.org/10.26599/AIR.2023.9150015]

10. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning; Online, 13–18 July 2020; pp. 1597-1607. [DOI: https://dx.doi.org/10.48550/arXiv.2002.05709]

11. Grill, J.; Strub, F.; Altche, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Pires, B.; Guo, Z.; Azar, M. Bootstrap your own latent: A new approach to self-supervised learning. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 21271-21284. [DOI: https://dx.doi.org/10.48550/arXiv.2006.07733]

12. Laine, S.; Aila, T. Temporal Ensembling for Semi-Supervised Learning. arXiv; 2016; [DOI: https://dx.doi.org/10.48550/arXiv.1610.02242] arXiv: 1610.02242

13. Yang, L.; Zhuo, W.; Qi, L.; Shi, Y.; Gao, Y. St++: Make self-training work better for semi-supervised semantic segmentation. Proceedings of the Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 18–24 June 2022; pp. 4268-4277. [DOI: https://dx.doi.org/10.48550/arXiv.2106.05095]

14. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst.; 2017; 30, pp. 1195-1204. [DOI: https://dx.doi.org/10.48550/arXiv.1703.01780]

15. Saleh, F.; Aliakbarian, M.; Salzmann, M.; Petersson, L.; Gould, S.; Alvarez, J. Built-in foreground/background prior for weakly-supervised semantic segmentation. Proceedings of the ECCV; Amsterdam, The Netherlands, 11–14 October 2016; pp. 413-432. [DOI: https://dx.doi.org/10.1007/978-3-319-46484-8_25]

16. Yang, R.; Song, L.; Ge, Y.; Li, X. BoxSnake: Polygonal Instance Segmentation with Box Supervision. Proceedings of the International Conference on Computer Vision (ICCV); Paris, France, 2–3 October 2023; pp. 766-776. [DOI: https://dx.doi.org/10.48550/arXiv.2303.11630]

17. Mei, C.; Yang, X.; Zhou, M.; Zhang, S.; Chen, H.; Yang, X.; Wang, L. Semi-supervised image segmentation using a residual-driven mean teacher and an exponential Dice loss. Artif. Intell. Med.; 2024; 148, 102757. [DOI: https://dx.doi.org/10.1016/j.artmed.2023.102757] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38325920]

18. Yu, L.; Wang, S.; Li, X.; Fu, C.; Heng, P. Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation. Proceedings of the Medical Image Computing and Computer Assisted Intervention; Shenzhen, China, 13–17 October 2019; pp. 605-613. [DOI: https://dx.doi.org/10.1007/978-3-030-32245-8_67]

19. Adiga, S.; Dolz, J.; Lombaert, H. Leveraging labeling representations in uncertainty-based semi-supervised segmentation. Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention; Singapore, 18–22 September 2022; pp. 265-275. [DOI: https://dx.doi.org/10.1007/978-3-031-16452-1_26]

20. Li, S.; Zhang, C.; He, X. Shape-aware semi-supervised 3D semantic segmentation for medical images. Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention; Lima, Peru, 4–8 October 2020; pp. 552-561. [DOI: https://dx.doi.org/10.1007/978-3-030-59710-8_54]

21. Luo, X.; Chen, J.; Song, T.; Chen, Y.; Zhang, S. Semi-supervised medical image segmentation through dual-task consistency. Proceedings of the AAAI Conference on Artificial Intelligence; Online, 2–9 February 2021; Volume 35, pp. 8801-8809. [DOI: https://dx.doi.org/10.48550/arXiv.2009.04448]

22. Shi, Y.; Zhang, J.; Ling, T.; Lu, J.; Zheng, Y.; Yu, Q.; Gao, Y. Inconsistency-aware uncertainty estimation for semi-supervised medical image segmentation. IEEE Trans. Med. Imaging; 2021; 41, pp. 608-620. [DOI: https://dx.doi.org/10.1109/TMI.2021.3117888] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34606452]

23. Zheng, Y.; Tian, B.; Yu, S.; Yang, X.; Yu, Q.; Zhou, J.; Jiang, G.; Zheng, Q.; Pu, J.; Wang, L. Adaptive boundary-enhanced Dice loss for image segmentation. Biomed. Signal Process. Control; 2025; 106, 107741. [DOI: https://dx.doi.org/10.1016/j.bspc.2025.107741] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/40061446]

24. Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision?. Adv. Neural Inf. Process. Syst.; 2017; 30, pp. 5580-5590. [DOI: https://dx.doi.org/10.48550/arXiv.1703.04977]

25. Xiong, Z.; Xia, Q.; Hu, Z.; Huang, N.; Zhao, J. A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging. Med. Image Anal.; 2021; 67, 101832. [DOI: https://dx.doi.org/10.1016/j.media.2020.101832] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33166776]

26. Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?. IEEE Trans. Med. Imaging; 2018; 37, pp. 2514-2525. [DOI: https://dx.doi.org/10.1109/TMI.2018.2837502] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29994302]

27. Bai, W.; Oktay, O.; Sinclair, M.; Suzuki, H.; Rajchl, M.; Tarroni, G.; Glocker, B.; King, A.; Matthews, P.; Rueckert, D. Semi-supervised learning for network-based cardiac MR image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention; Quebec City, QC, Canada, 10–14 September 2017; pp. 253-260. [DOI: https://dx.doi.org/10.1007/978-3-319-66185-8_29]

28. Taha, A.; Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging; 2015; 15, 29. [DOI: https://dx.doi.org/10.1186/s12880-015-0068-x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26263899]

29. Meyer, P.; Cherstvy, A.; Seckler, H.; Hering, R.; Blaum, N.; Jeltsch, F.; Metzler, R. Directedeness, correlations, and daily cycles in springbok motion: From data via stochastic models to movement prediction. Phys. Rev. Res.; 2023; 5, 043129. [DOI: https://dx.doi.org/10.1103/PhysRevResearch.5.043129]

30. Zheng, Q.; Li, Z.; Zhang, J.; Mei, C.; Li, G.; Wang, L. Automated segmentation of palpebral fissures from eye videography using a texture fusion neural network. Biomed. Signal Process. Control; 2023; 85, 104820. [DOI: https://dx.doi.org/10.1016/j.bspc.2023.104820]

31. Zheng, Q.; Zhang, X.; Zhang, J.; Bai, F.; Huang, S.; Pu, J.; Chen, W.; Wang, L. A texture-aware U-Net for identifying incomplete blinking from eye videography. Biomed. Signal Process. Control; 2022; 75, 103630. [DOI: https://dx.doi.org/10.1016/j.bspc.2022.103630] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36127930]

Word count: 6288

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The accurate segmentation of medical images is of great importance in many clinical applications and is generally achieved by training deep learning networks on a large number of labeled images. However, it is very hard to obtain enough labeled images. In this paper, we develop a novel semi-supervised segmentation method (called PE-MT) based on the uncertainty-aware mean teacher (UA-MT) framework by introducing a perturbation-enhanced exponential moving average (pEMA) and a residual-guided uncertainty map (RUM) to enhance the performance the student and teacher models. The former is used to alleviate the coupling effect between student and teacher models in the UA-MT by adding different weight perturbations to them, and the latter can accurately locate image regions with high uncertainty via a unique quantitative formula and then highlight these regions effectively in image segmentation. We evaluated the developed method by extracting four different cardiac regions from the public LASC and ACDC datasets. The experimental results showed that our developed method achieved an average Dice similarity coefficient (DSC) of 0.6252 and 0.7836 for four object regions when trained on 5% and 10% labeled images, respectively. It outperformed the UA-MT and can compete with several existing semi-supervised learning methods (e.g., SASSNet and DTC).

Details

Title

PE-MT: A Perturbation-Enhanced Mean Teacher for Semi-Supervised Image Segmentation

Author

Wang, Wenquan¹; Li Zhongwen²

; Zhang, Xiaoyun³; Jiang Gaoqiang⁴; Wu Yabo⁵; Yu, Shuchen⁴; Tian Bihan⁴; Hu Mingzhe¹; Xu, Xiaomin¹; Wu Wencan⁶; Quanyong, Yi²; Wang, Lei⁶

¹ Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, The Third Affiliated Hospital of Shanghai University, Wenzhou People’s Hospital, Wenzhou 325041, China; [email protected] (W.W.); [email protected] (M.H.); [email protected] (X.X.)
² Ningbo Key Laboratory of Medical Research on Blinding Eye Diseases, Ningbo Eye Institute, Ningbo Eye Hospital, Wenzhou Medical University, Ningbo 315040, China; [email protected]
³ The Business School, The University of Sydney, Sydney 2006, Australia; [email protected]
⁴ National Engineering Research Center of Ophthalmology and Optometry, Eye Hospital, Wenzhou Medical University, Wenzhou 325027, China; [email protected] (G.J.); [email protected] (S.Y.); [email protected] (B.T.); [email protected] (W.W.)
⁵ School of Biomedical Engineering and Imaging Sciences, King’s College London, London WC2R 2LS, UK; [email protected]
⁶ National Engineering Research Center of Ophthalmology and Optometry, Eye Hospital, Wenzhou Medical University, Wenzhou 325027, China; [email protected] (G.J.); [email protected] (S.Y.); [email protected] (B.T.); [email protected] (W.W.), National Clinical Research Center for Ocular Diseases, Eye Hospital, Wenzhou Medical University, Wenzhou 325027, China

First page

453

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

23065354

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/bioengineering12050453

ProQuest document ID

3211860355

PE-MT: A Perturbation-Enhanced Mean Teacher for Semi-Supervised Image Segmentation

Jump to:

Full Text

1. Introduction

2. Method

2.1. Scheme Overview

2.2. Semi-Supervised Segmentation

2.3. The pEMA

2.4. The RUM

3. Experiments and Results

3.1. Dataset and Evaluation Metrics

3.2. Implementation Details

3.3. Segmentation of the LASC Dataset

3.4. Segmentation of the ACDC Dataset

3.5. Ablation Study

3.5.1. Effect of the pEMA and RUM

3.5.2. The Parameters $υ$ and $β$

4. Discussion

5. Conclusions

Abstract

Details

Suggested sources

PE-MT: A Perturbation-Enhanced Mean Teacher for Semi-Supervised Image Segmentation

Jump to:

Full Text

1. Introduction

2. Method

2.1. Scheme Overview

2.2. Semi-Supervised Segmentation

2.3. The pEMA

2.4. The RUM

3. Experiments and Results

3.1. Dataset and Evaluation Metrics

3.2. Implementation Details

3.3. Segmentation of the LASC Dataset

3.4. Segmentation of the ACDC Dataset

3.5. Ablation Study

3.5.1. Effect of the pEMA and RUM

3.5.2. The Parameters υ and β

4. Discussion

5. Conclusions

Abstract

Details

3.5.2. The Parameters $υ$ and $β$