Full Text

Turn on search term navigation

1. Introduction

The core goal of speech enhancement is to suppress background noise and improve the quality and intelligibility of noisy speech signals. Traditional speech enhancement methods are typically divided into time–frequency (T-F) domain filtering methods and statistical model-based methods. Wiener filtering and the spectral minimum mean square error (MMSE) estimator proposed by Ephraim and Malah [1] are pioneering works in the field of statistical modeling, laying the foundation for this area. Recent advancements in generalized kernel methods, including the works of Soares et al. [2] and May et al. [3], have also brought new progress to the field, further expanding the toolkit for speech enhancement and recognition tasks. T-F domain filtering methods include spectral subtraction [4,5,6,7], subspace methods [8,9], Wiener filtering [10,11,12,13,14], and minimum mean square error estimation [15,16]. These methods usually assume that the noise is stationary and achieve noise reduction by attenuating the noise spectrum. However, real-world noise lacks structured patterns, unlike synthetically generated non-stationary noise, which may still exhibit regular behavior. Another category of methods is based on statistical models, such as hidden Markov models [17] and Gaussian mixture models [18], which rely on specific acoustic assumptions and probabilistic model structures. However, in real scenarios, noise often exhibits variability, making acoustic assumptions about noise difficult and thereby limiting the denoising performance.

Unlike traditional methods, deep learning [19,20,21,22] techniques can automatically learn complex patterns from large-scale datasets to achieve better speech enhancement performance. Deep learning models are generally categorized into discriminant models (D) and generative models (G). The core idea of the discriminant model is to learn a mapping from noisy speech to clean speech, typically using time domain or frequency domain methods [23,24,25,26,27], complex spectrum mapping [28], or directly operating in the time domain [29,30,31]. In contrast, generative models learn the underlying statistical properties of clean speech data, enabling them to perform well even when the training and test datasets differ. Typical generative models include approaches such as generative adversarial networks [32,33,34] and variational autoencoders [35,36].

Many areas have been studied regarding diffusion models. Zhang et al. [37] proposed a method for restoring degraded speech using an improved diffusion model, modifying the DiffWave architecture to recover the original speech signal better. Joan Serrà et al. [38] proposed a generative model that combines score-based diffusion with a multi-resolution conditioning network enhanced by a mixture density network, capable of handling 55 different types of distortion simultaneously. Julius Richter et al. [39] proposed an audiovisual speech enhancement system that leverages a score-based diffusion model informed by visual cues and uses audiovisual embeddings derived from a self-supervised learning model fine-tuned for lip reading. Their system improves speech quality and reduces generation artifacts, such as speech confusion. Yang et al. [40] proposed a unified speech enhancement and editing model using conditional diffusion models to handle various tasks in a generative manner. Based on these developments, this paper adopts a score-based diffusion model, which consists of a forward process that gradually corrupts clean speech with noise and a reverse process that starts from the noisy input to iteratively estimate the original signal. However, diffusion-based methods may encounter challenges in generalizing to unfamiliar conditions. To address this problem, we adopt a score-based diffusion model integrated with the EMA mechanism. This approach effectively captures multi-level information and utilizes a U-Net structure for speech restoration.

The main contributions of this study are as follows:

We pay special attention to the score-based diffusion model and the multi-resolution U-Net model and make corresponding modifications to the U-Net structure to better handle speech enhancement tasks.

We introduce an EMA mechanism that captures both multi-scale contextual information and local details, thereby enhancing the performance of speech enhancement.

To verify the effectiveness of our proposed method, we conducted experiments on the VB-DMD and TIMIT-TUT datasets. The results show that our SGM-EMA method performed well on the speech enhancement task.

We also conducted ablation studies to evaluate the contribution of the method in the predictor of the PC sampler and the EMA mechanism to the overall model performance.

2. Methodology

2.1. Diffusion Process Based on SDE

In speech enhancement tasks, clean speech is defined as the original speech signal, which includes only the speaker’s voice and excludes any external interference sources. Noisy speech refers to a speech signal that is mixed with other sound sources in addition to the speaker’s voice. The diffusion process consists of two stages: the forward process and the reverse process, as shown in Figure 1.

(1). Forward process: The forward process is implemented by gradually adding Gaussian noise to clean speech. However, as the model is specifically designed for Gaussian noise, it may not generalize well to non-Gaussian or more complex real-world noise conditions. Following the approach of Song et al. [41], we designed a random diffusion process ${\{x_{t}\}}_{t = 0}^{T}$ , which is the solution to the following linear stochastic differential equation (SDE):

(1) $d x_{t} = f (x_{t}, y) d t + g (t) d w,$

where

x_{t}

represents the current speech state,

t \in [0, T]

is a continuous time step variable describing the process progress,

y

is the noisy or reverberant speech,

w

is the standard Wiener process,

x_{0}

represents the clean speech, and

x_{T}

represents the Gaussian distribution centered on the noisy speech. We incorporated the noisy speech

y

into the SDE by modifying the drift coefficient to

f (x_{t}, y) ≔ γ (y - x_{t})

, where

γ

is a constant, called stiffness, which controls the transition from

x_{0}

y

. The diffusion coefficient

g (t)

controls the amount of Gaussian white noise injected at each time step and is defined as follows:

(2) $g (t) ≔ σ_{m i n} {(σ_{m a x} / σ_{m i n})}^{t} \sqrt{2 l o g (σ_{m a x} / σ_{m i n}),}$

where

σ_{m i n}

and

σ_{m a x}

are parameters that define the noise schedule of the Wiener process.

(2). Reverse process: The reverse process gradually removes noise from the noisy signal and estimates a signal close to the original clean speech. According to Song et al. [41] and Anderson et al. [42], the SDE in Equation (1) has a related reverse SDE. This is achieved by solving the following differential equation:

(3) $d x_{t} = [g {(t)}^{2} \nabla_{x_{t}} \log p_{t} (x_{t}| y) - f (x_{t}, y)] d t + g (t) d \bar{w},$

where

\bar{w}

is the standard Wiener process evolving backward in time, and the score function

\nabla_{x_{t}} \log_{pt} (x_{t}| y)

is a term approximated by the DNN, called the score model. We denote the score model as

S_{θ} (x_{t}, y, t)

, and by substituting the score model into the reverse SDE in Equation (3), we obtain the so-called plug-in reverse SDE:

(4) $d x_{t} = [g {(t)}^{2} S_{θ} (x_{t}, y, t) - f (x_{t}, y)] d t + g (t) d \bar{w},$

Sampling is then performed as follows:

(5) $x_{T} \sim N_{C} (x_{T}; y, σ {(T)}^{2} I),$

where

x_{T}

is the severely corrupted data distribution of the noisy speech

y

I

is the identity matrix, and

N_{C}

is the circularly symmetric complex normal distribution. Once the score model has been trained, the reverse SDE defined in Equation (4) can be used to iteratively estimate clean speech using the predictor–corrector sampling algorithm proposed by Song et al. [41].

(3). Training objective: The objective function for training the score model $S_{θ} (x_{t}, y, t)$ is derived as follows. Based on Equation (1), the forward diffusion process defines a Gaussian process, allowing the mean and variance of the process state $x_{t}$ to be determined when the initial conditions are known [43]. Consequently, this allows for direct sampling of $x_{t}$ at any time step $t$ given $x_{0}$ and $y$ by using the perturbation kernel:

(6) $p_{0 t} (x_{t}| x_{0}, y) = N_{C} (x_{t}; μ (x_{0}, y, t), σ {(t)}^{2} I),$

where

N_{C}

is a circularly symmetric complex normal distribution and

I

is the identity matrix. We use Equations (5.50, 5.53) in Särkkä and Solin [43] to determine the mean:

(7) $μ (x_{0}, y, t) = e^{- γ t} x_{0} + (1 - e^{- γ t}) y,$

and the variance of the closed-form solution.

(8) $σ {(t)}^{2} = \frac{σ_{\min}^{2} ({(\frac{σ_{\max}}{σ_{\min}})}^{2 t} - e^{- 2 γ t}) \log (\frac{σ_{\max}}{σ_{\min}})}{γ + \log (\frac{σ_{\max}}{σ_{\min}})}$

Vincent [44] showed that the score model $S_{θ}$ fitted to the perturbation kernel $\nabla_{x_{t}} \log_{p 0 t} (x_{t} | x_{0}, y)$ is equivalent to implicit and explicit score matching under certain regularity conditions [45]. This is essentially an estimation result.

Accordingly, $x_{t}$ can be efficiently calculated as

(9) $x_{t} = μ (x_{0}, y, t) + σ (t) z,$

where

z \sim N_{C} (z; 0, I)

Using the score-matching principle [44], we simplify the perturbation kernel $\nabla_{x_{t}} \log_{p 0 t} (x_{t} | x_{0}, y)$ to

(10) $\nabla_{x_{t}} \log p_{0 t} (x_{t}| x_{0}, y) = \nabla_{x_{t}} \log [{|2 π σ I|}^{- \frac{1}{2}} e^{- \frac{‖x_{t} - μ‖ \binom{2}{2}}{2 σ^{2}}}]$

(11) $= \nabla_{x_{t}} {\log |2 π σ (t)|}^{- \frac{1}{2}} - \nabla_{x_{t}} \frac{‖x_{t} - μ (x_{0}, y, t)‖ \binom{2}{2}}{2 σ {(t)}^{2}}$

(12) $= - \frac{x_{t} - μ (x_{0}, y, t)}{σ {(t)}^{2}}$

Substituting Equation (9) into Equation (12), we obtain

(13) $x_{t} - μ (x_{0}, y, t) = (μ (x_{0}, y, t) + σ (t) z) - μ (x_{0}, y, t) = σ (t) z$

Therefore, Equation (12) can be rewritten as

(14) $\nabla_{x_{t}} \log p_{0 t} (x_{t}| x_{0}, y) = - \frac{x_{t} - μ (x_{0}, y, t)}{σ {(t)}^{2}} = \frac{σ (t) z}{σ {(t)}^{2}}$

Simplifying the expression yields

(15) $\nabla_{x_{t}} \log p_{0 t} (x_{t}| x_{0}, y) = \frac{z}{σ (t)}$

After $(x_{t}, y, t)$ is input into the score model (see Equation (4)), the final loss is the unweighted $L_{2}$ loss between the model output ( $s_{θ} (x_{t}, y, t)$ ) and the score of the perturbation kernel (derived from Equation (15)). The overall training objective is given by Equation (16):

(16) $\arg \min_{θ} E_{t, (x_{0}, y), z, x_{t} | (x_{0}, y)} [‖s_{θ} (x_{t}, y, t) + \frac{z}{σ (t)}‖ \binom{2}{2}],$

2.2. Numerical SDE Solver

In the speech enhancement task, the accuracy and stability of the numerical SDE solver directly affect the performance of the model. This section elaborates on the solution method for numerical SDEs.

When dealing with the numerical solution of stochastic differential equations, researchers have developed a variety of approximate methods based on discrete time steps. In general, the time interval $[0, T]$ is divided into $N$ equal parts, with each subinterval having a length of $Δ t = T / N$ . Based on this, the original continuous formula is transformed into a discrete sequence { $x_{t}$ , $x_{T - Δ t}$ , …, $x_{0}$ } to be solved numerically. Among the many single-step methods, the Euler–Maruyama method is widely used. In each iteration, it first checks the state of the previous moment and then combines the characteristics of drift and Brownian motion to infer the state value of the current moment. This method is relatively intuitive in calculation, but it is also necessary to pay attention to the impact of the choice of step size on the result.

In this study, we used the predictor–corrector (PC) samplers proposed by Song et al. [41]. The core of this sampler is to cleverly combine the solution strategy of the reverse SDE with numerical optimization methods, such as annealing Langevin dynamics [46]. Algorithm 1 [41] outlines the principles of the PC sampler. The PC sampler mainly consists of two main components: the predictor and the corrector. The predictor can use various single-step calculation methods, and its primary function is to approximate the reverse SDE through iterative computation. After each iteration of the predictor, the current state of the process will be further optimized and improved by the corrector. The corrector is essentially a stochastic gradient ascent optimizer based on Markov chain Monte Carlo sampling. In each iterative step, a small adjustment is made along the gradient direction of the estimated score function, and a small amount of noise interference is introduced to ensure that the sampling points converge near the target distribution, improving the generation quality and stability. The solver is implemented in Python 3.12.3.

In this study, we explore two different numerical methods for the predictor: the Euler–Maruyama method, a classic single-step method for approximating SDEs, and the reverse diffusion method, which offers higher accuracy in solving the reverse SDE. Through iterative updates, the predictor guides the sampling trajectory toward cleaner speech representations. Our ablation studies demonstrate that the choice of method in the predictor impacts speech enhancement quality.

Algorithm 1: Predictor–Corrector (PC) Sampling

Require: N: Number of discretization steps for the reverse-time SDE M: Number of Corrector steps

1 : Initialize x_{N} ∽ x_{T}

2 : for i = N - 1 t o

d o

3 : x_{i} \leftarrow P r e d i c t o r (x_{i + 1})

4 : for j = 1 t o M d o

5 : x_{i} \leftarrow C o r r e c t o r (x_{i})

6 : return x_{0}

2.3. EMA Mechanism

The basic principle of the traditional attention mechanism is to assign weights to each element in the input sequence and perform a weighted summation to extract the key features. It has been shown [47,48] that capturing multi-level contextual information is crucial to improving the performance of speech enhancement models based on deep neural networks. The EMA mechanism [49] adopts a novel approach by dynamically generating attention weights through multi-scale parallel sub-networks and cross-space learning instead of the traditional exponential moving average. To update the attention weights, EMA incorporates three parallel pathways, with two in the 1 × 1 branch and one in the 3 × 3 branch, as shown in Figure 2 [49].

In the EMA mechanism, the global spatial information from the 1 × 1 convolution branch is processed by applying Softmax activation to generate channel-wise attention weights, while the 3 × 3 convolution branch output is reshaped to match the dimensions of the 1 × 1 branch. The features from these two branches are then combined through matrix multiplication, leading to the formation of the first spatial attention map. At the same time, a 2D average pooling operation is performed on the 3 × 3 branch, and the resulting features are passed through a Softmax function to produce another set of attention weights. Similarly, the 1 × 1 branch output is reshaped, and matrix multiplication between the two branches yields the second spatial attention map. To refine the spatial attention distribution, the EMA module aggregates the feature outputs guided by the attention maps from both branches. This design allows the model to exploit complementary spatial features learned in parallel, enhancing its ability to capture richer contextual information. In this way, when the model performs attention calculations, it not only considers the current input features but also effectively integrates contextual information from previous layers.

It is particularly worth mentioning that the EMA mechanism adopts a parallel substructure design, which effectively avoids excessive sequential processing and eliminates the need for increased network depth. This reduces redundant sequence processing steps and decreases instability during training, thereby improving the overall performance of the model. In addition, the parallelization of convolution operations further strengthens the expressive capacity of the structure. By combining the parallel computing methods of 3 × 3 convolution and 1 × 1 convolution, the model can simultaneously capture local short-term dependencies and global long-term dependencies, thereby merging more contextual information into the intermediate feature map. The EMA module will be further evaluated through ablation studies in Section 4.5.

2.4. Multi-Resolution U-Net Network

In recent years, significant progress has been made in enhancing acoustic quality through deep learning methods, with the U-Net architecture emerging as a particularly effective model [50,51]. As illustrated in Figure 3, the U-Net consists of three main components: an encoder, a decoder, and skip connections. Specifically, the encoder progressively downsamples the input audio to extract high-level acoustic representations, while the decoder performs upsampling to reconstruct enhanced audio with dimensions matching the original input. The skip connections directly transfer feature maps from the encoder to the corresponding decoder layers, enabling the integration of low-level and high-level information to facilitate more accurate detail restoration.

Some U-Net-based speech enhancement models rely on ordinary convolution operations in the encoder and decoder, which may overlook speech contextual information and result in the loss of detailed features. To address these problems, we incorporated the EMA mechanism into the U-Net-based network. This approach focuses on the detailed speech features, extracts local features at different scales, and captures more contextual information, thereby enabling more effective speech enhancement.

3. Network Model Structure

The SGM-EMA model is built on the Noise Conditional Score Network++ (NCSN++) architecture [41] and uses a multi-resolution U-Net-based structure. Previous studies [52] have shown that this type of structure has strong capabilities in tasks such as generation and segmentation. In Figure 4, we show feature maps at different resolutions, annotate their spatial sizes and number of channels, and use arrows to illustrate the transformations between feature maps.

The main body of the network consists of symmetrical downsampling and upsampling paths, along with skip connections. The input and output layers use Conv2D layers with 3 × 3 convolution kernels and a stride of 1. The residual block structure derived from the BigGAN architecture [53] is embedded in both the downsampling and upsampling blocks. Each residual block consists of the same Conv2D layer as above, group normalization [54], LeakyReLU activation function, and either upsampling or downsampling layers based on finite impulse response (FIR) filters [55]. Each downsampling path contains two residual blocks, and each upsampling path contains three residual blocks [56]. A 1 × 1 Conv2D layer is employed to facilitate the progressive transformation of feature dimensionality [57]. The network performs symmetrical dimensionality reduction and expansion through feature maps of specific resolutions. The output dimensions of the downsampling paths are (256, 256, 128), (128, 128, 128), (64, 64, 256), (32, 32, 256), (16, 16, 256), (8, 8, 256), and (4, 4, 256). To enhance feature fusion, an EMA mechanism is added to the bottleneck layer and the 16 × 16 and 64 × 64 resolution layers. The overall architecture supports efficient gradient propagation and multi-scale feature aggregation through the combination of residual connections and attention mechanisms, showing significant performance advantages in speech enhancement tasks.

To enhance the model’s temporal perception, we used Fourier embeddings [58] to map the time scalar $t$ into an m-dimensional vector $t_{e m b}$ . Fourier transform was then applied to encode temporal information into a representation suitable for neural network processing. This vector was embedded into each residual block, as shown in Figure 4, so that the model could simultaneously consider the interaction between temporal dynamics and spatial features. This temporal encoding mechanism not only allowed the network to model both the current state and overall temporal trends but also improved speech quality and enhanced generalization.

4. Experimental Setup

4.1. Dataset

We used two datasets, the VB-DMD dataset and the TIMIT-TUT dataset, as detailed in Table 1. The performance of the proposed model was evaluated by dual dataset verification. Specifically, one dataset was used for training, while the other dataset was used as the test set. This experimental design effectively assessed the model’s adaptability to data with different distribution characteristics, thereby more realistically reflecting its generalization performance.

(a). VB-DMD: The experiments used the publicly available VB-DMD dataset [59], which is widely used in speech enhancement and denoising research (it is available at https://doi.org/10.7488/ds/2117, accessed on 10 September 2024). The clean speech signal came from the VoiceBank dataset, with a sampling rate of 48 kHz, and the noisy signal came from the DEMAND dataset, with a sampling rate of 48 kHz. The dataset contains a variety of real-world background noises, such as cafes, streets, and offices. The training set (11,572 utterances) consisted of 28 speakers with signal-to-noise ratios (SNRs) of 15, 10, 5, and 0 dB. The test set (824 utterances) consisted of 2 speakers with SNRs of 17.5, 12.5, 7.5, and 2.5 dB. Both the training and test sets were resampled to 16 kHz.

(b). TIMIT-TUT: The TIMIT-TUT dataset was created by selecting the complete test set of the acoustic–phonetic continuous speech corpus jointly constructed by Texas Instruments (TI), Massachusetts Institute of Technology (MIT), and Stanford Research Institute (SRI), and the noise signal from TUT. The TUT noise dataset covers a variety of sound events in daily environments, such as traffic noise, human voices, animal calls, etc. (it is available at https://paperswithcode.com/dataset/tut-sound-events-2017, accessed on 13 September 2024). The TIMIT [60] complete test set of clean speech consists of 1344 utterances spoken by 168 speakers with a sampling rate of 16 kHz (it is available at https://goo.gl/l0sPwz, accessed on 14 September 2024). The TIMIT-TUT mixed dataset was a random segment from a noise file randomly selected from the noise signal and added to the clean speech signal, generating a total of 1344 mixed speech samples for testing. All mixed datasets were resampled uniformly to 16 kHz, with a signal-to-noise ratio between 0 and 20 dB.

4.2. Hyperparameter Configuration

The hyperparameters in Equations (2) and (7) were set to $σ_{m i n} = 0.05$ , $σ_{m a x} = 0.5,$ and $γ = 1.5$ . The Adam optimizer was used for training, the learning rate was set to $10^{- 4}$ , and the batch size was 8. The exponential moving average decay rate of the model weight was 0.999, which was used for sampling [61], and the sampler used the PC sampler. The number of backward steps in the SDE was set to N = 30 [56], and the step size r of the annealing Langevin dynamics in the corrector was set to 0.5.

The U-Net network had 14 layers (7 layers for the encoder and 7 layers for the decoder). The resolution and number of channels of the encoder and decoder were symmetrical. We show the number of channels and resolution of the encoder in Table 2.

The configurations used in our experiments are shown in Table 3.

4.3. Baseline

We compared the proposed method with three discriminative baselines (MetricGAN+ [62], SERGAN [63], and CMGAN [64]) and two generative baselines (RVAE [65] and CDiffuse [66]). All methods utilized the pre-trained models provided in the publicly available code from the corresponding papers to evaluate the datasets. Except for the SERGAN method, we retrained the models from scratch on the VB-DMD dataset and then evaluated them on both the VB-DMD and TIMIT-TUT datasets.

4.4. Evaluation Metrics

To evaluate the performance of the proposed method, three objective metrics were adopted: PESQ [67], ESTOI [68], and SI-SDR [69].

(a). PESQ: This metric assesses perceptual speech quality and is commonly used in voice communication and speech enhancement. It produces scores ranging from −0.5 to 4.5, where a higher score indicates better speech quality and a lower score indicates poor speech quality.

(b). ESTOI: This is an objective indicator for evaluating speech intelligibility in speech enhancement systems. ESTOI provides scores between 0 and 1, where values closer to 1 indicate better speech intelligibility, and values closer to 0 indicate poorer speech intelligibility.

(c). SI-SDR: This metric evaluates signal reconstruction accuracy in tasks such as speech enhancement, denoising, and source separation. Values below 10 dB suggest significant distortion and poor reconstruction, scores between 10 and 20 dB indicate acceptable quality with moderate artifacts, and values above 20 dB imply high-fidelity reconstruction with minimal distortion or noise.

4.5. Ablation Study

To evaluate the effectiveness of the Euler–Maruyama method and the reverse diffusion method in the predictor component of the PC sampler, as well as the contribution of the EMA mechanism to the performance of the speech enhancement model, we conducted a series of ablation studies. By selectively disabling or replacing key components of the predictor and the EMA module, we compared multiple model configurations on the VB-DMD and TIMIT-TUT datasets. The results, presented in Table 4 and Table 5 as the means ± standard deviations, demonstrate that the reverse diffusion method was more effective than the Euler–Maruyama method, and that the inclusion of the EMA mechanism further improved the overall model performance. Specifically, removing the reverse diffusion method led to decreases in the SI-SNR, ESTOI, and PESQ scores, highlighting the importance of accurately solving the reverse SDE to generate high-quality enhanced speech. Among these, the drop in ESTOI reflects a reduction in speech intelligibility, while decreases in the PESQ and SI-SNR indicate losses in perceptual quality and signal fidelity, respectively. Similarly, removing the EMA mechanism also caused performance degradation, confirming its critical role in ensuring model stability and convergence. In addition, the ESTOI scores consistently increased with the inclusion of each component, particularly the reverse diffusion method and EMA. When both were applied together, the highest ESTOI score of 0.91 was achieved, indicating a substantial improvement in intelligibility. These findings confirm the necessity of incorporating both techniques into the proposed framework, as they not only improved signal quality and perceptual clarity but also ensured that the enhanced speech remained intelligible, an essential factor for real-world applications, such as voice communication and assistive technologies.

5. Experimental Results and Discussion

5.1. Experimental Results

We used matched training and testing datasets—specifically, we trained the model on the VB-DMD training set and evaluated it on the corresponding test set; the experimental results are presented in Table 6. The average values of all metrics in Table 6 are visualized in Figure 5a. RVAE is an unsupervised speech enhancement method trained solely on clean speech (VB). The results show that although our SGM-EMA model did not outperform CMGAN in terms of the PESQ metric, this was largely due to CMGAN’s specifically designed PESQ optimization module in its discriminator, as well as the use of a multi-domain joint loss function. These design choices effectively suppress distortion, allowing CMGAN to achieve higher PESQ scores. Nevertheless, SGM-EMA still outperformed other models and surpassed the discriminative model SERGAN, demonstrating strong speech restoration capabilities. Compared to other baseline methods, such as RVAE and CDiffuse, SGM-EMA achieved significant improvements in both the PESQ and SI-SDR metrics.

To further validate the statistical reliability of the results, we report the 95% confidence intervals for each evaluation metric (values in parentheses in Table 6). Although the confidence intervals of SGM-EMA for PESQ and ESTOI metrics partially overlapped with those of baseline models, like CMGAN and SERGAN, our model demonstrated systematic superiority: it achieved higher mean values across all metrics while maintaining narrower confidence intervals, highlighting its exceptional speech recovery capability. Particularly for the SI-SDR metric, SGM-EMA not only sustained superior average performance but also exhibited lower variance, demonstrating remarkable stability. These results not only verify the outstanding performance of SGM-EMA in speech enhancement tasks but also statistically confirm its reliability, establishing a solid foundation for further advancements in speech enhancement technology.

When the training and testing datasets were mismatched—that is, when the models were trained on the VB-DMD dataset and evaluated on the TIMIT-TUT dataset—we evaluated model performance under these mismatch conditions. We tested the TIMIT-TUT data using pretrained models provided in the official codebases of the respective papers. The experimental results are summarized in Table 7. To provide a more intuitive comparison, the average values of each metric in Table 7 are visualized in Figure 5b. As expected, the overall performance declined compared to the matched condition results in Table 4. This drop is primarily attributed to the fact that the specific characteristics of the test set were not exposed during training. The signal characteristic differences between the VB-DMD training data and the TIMIT-TUT test set were significant enough to cause a degradation in the evaluation scores.

Nevertheless, under the mismatched conditions, our SGM-EMA model outperformed all other baseline methods across every evaluation metric, demonstrating exceptional robustness to variations in signal characteristics. Although the scores on the TIMIT-TUT test set were lower than those on the VB-DMD test set, SGM-EMA still outperformed the competing methods. This enabled SGM-EMA to achieve the best overall scores on the TIMIT-TUT dataset, highlighting its robustness to unseen noise characteristics and indicating its strong generalization ability.

To further validate the statistical reliability of these results, we also report the 95% confidence intervals for each evaluation metric, as shown in the parentheses in Table 7. SGM-EMA consistently achieved the best average scores across all evaluation metrics under the mismatched conditions. While some confidence intervals partially overlap, the performance trend suggests an improvement compared to the baseline methods, especially in the SI-SDR and ESTOI.

5.2. Trade-Offs in Model Design

Our model utilizes a deep U-Net with multiple resolution levels, a progressive growth path, and attention modules. While this complexity enables high-quality reconstruction and better modeling of global dependencies, it also leads to increased computational costs and slower inference times. We employed a score-based generative objective to encourage the generation of natural, manifold-aligned speech. This approach typically generalizes better to unseen noise types, as demonstrated by the results under mismatched conditions in Table 7. However, this comes at the cost of waveform fidelity, occasionally leading to the generation of artifacts under extreme noise conditions. The introduction of attention modules at specific resolutions improves global dependency modeling but increases the number of model parameters and inference complexity. The choice of group norm over alternatives like batch norm is due to its robustness to small batch sizes, which is critical for high-resolution spectrogram modeling. However, compared to the more lightweight batch norm, it introduces a small amount of additional computational overhead.

6. Conclusions

We propose a novel speech enhancement method that integrates an EMA mechanism with a U-Net architecture, trained and optimized as a score-based diffusion model. The experimental results show that under the conditions of matching training and testing, our method achieved better performance than other generative models and outperformed other comparative methods in both the ESTOI and SI-SDR evaluation indicators, demonstrating its excellent speech enhancement capability. Under mismatched conditions, while exhibiting some performance degradation compared to the matched scenario, our approach still outperformed all competing methods across all evaluation metrics, demonstrating remarkable generalization ability. Ablation studies further confirmed the importance of the reverse diffusion predictor and the EMA mechanism, revealing their critical roles in improving both enhancement quality and model stability.

For future research, we plan to explore multimodal information fusion to further enhance speech enhancement performance. For instance, we will investigate incorporating visual information (such as speaker lip movement features) to combine the complementary advantages of audio and visual modalities, enabling the model to capture more comprehensive speech cues and thereby achieving better enhancement results.

Author Contributions

Conceptualization, Y.W. and Z.L.; methodology, Z.L.; software, Z.L.; validation, Y.W., Z.L. and H.H.; formal analysis, Y.W.; investigation, Z.L.; writing—original draft preparation, Y.W. and Z.L.; writing—review and editing, Y.W., Z.L. and H.H.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

EMA	Efficient multi-scale attention
VB-DMD	VoiceBank-DEMAND
VB	VoiceBank
DMD	DEMAND
TIMIT	The DARPA TIMIT Acoustic–Phonetic Continuous Speech Corpus
TUT	TUT Sound Events 2017
PESQ	Perceptual evaluation of speech quality
ESTOI	Extended short-time objective intelligibility
SI-SDR	Scale-invariant signal-to-distortion ratio
T-F	Time–frequency
D	Discriminant models
G	Generative models
SDE	Stochastic differential equation
SDE	Stochastic differential equations
PC	Predictor–corrector
U-Net	Convolutional Networks for Biomedical Image Segmentation
NCSN++	Noise Conditional Score Network++
Conv2D	Two-Dimensional Convolution
FIR	Finite impulse response
TI	Texas Instruments
MIT	Massachusetts Institute of Technology
SRI	Stanford Research Institute
SNR	Signal-to-noise ratio

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Speech signal diffusion process.

Figure 2 EMA mechanism module structure diagram.

Figure 3 U-Net basic structure diagram.

Figure 4 Network model structure diagram.

Figure 5 (a) The data from Table 6 [60,61,62,63,64,65,66]; (b) the data from Table 7 [60,61,62,63,64,65,66]. For all evaluation metrics, higher values indicate better performance. The results demonstrate the clear advantage of SGM-EMA in speech enhancement, as evidenced by the visual comparison.

Table 1

Datasets used in the experiment.

Dataset Type	SNR	Quantity	Number of Speakers	Sampling Rate
VB-DMD Training Set	15 dB, 10 dB, 5 dB, 0 dB	11,572	28	16 kHz
VB-DMD Test Set	17.5 dB, 12.5 dB, 7.5 dB, 2.5 dB	824	2	16 kHz
TIMIT-TUT Test Set	0–20 dB uniform sampling	1344	168	16 kHz

Table 2

U-Net encoder layer parameter table.

Number of Encoder Layers	Resolution	Channels	Normalization
1	256 × 256	128	Group normalization
2	128 × 128	128
3	64 × 64	256
4	32 × 32	256
5	16 × 16	256
6	8 × 8	256
7	4 × 4	256

Table 3

Experimental configuration table.

Name	Specific Configuration
Operating System	Linux-5.15.0-86-generic-x86_64-with-glibc2.35
Processor	NVIDIA vGPU-32 GB
Memory	32 GB
OS Bit	64-bit
Programming Language	CPython 3.12.3
Dataset	VB-DMD and TIMIT-TUT
Deep Learning Framework	PyTorch Lightning 2.1.4

Table 4

Ablation study results on the VB-DMD dataset. (Model A: Euler–Maruyama method; Model B: reverse diffusion method; Model C: EMA mechanism).

Experiment Number	Model A	Model B	Model C	PESQ (↑)	ESTOI (↑)	SI-SDR (↑)
1	×	×	×	2.12 ± 0.49	0.76 ± 0.09	12.8 ± 4.1
2	×	×	√	2.24 ± 0.51	0.80 ± 0.09	14.1 ± 4.0
3	√	×	×	2.57 ± 0.65	0.78 ± 0.10	16.3 ± 3.7
4	√	×	√	2.69 ± 0.63	0.86 ± 0.10	17.3 ± 3.5
5	×	√	×	2.72 ± 0.63	0.85 ± 0.10	16.6 ± 3.1
6	×	√	√	2.89 ± 0.80	0.91 ± 0.07	19.6 ± 3.5

↑: indicates that a higher value corresponds to better performance.

Table 5

Ablation study results on the TIMIT-TUT dataset. (Model A: Euler–Maruyama method; Model B: reverse diffusion method; Model C: EMA mechanism).

Experiment Number	Model A	Model B	Model C	PESQ (↑)	ESTOI (↑)	SI-SDR (↑)
1	×	×	×	1.74 ± 0.40	0.72 ± 0.11	11.2 ± 4.1
2	×	×	√	1.81 ± 0.44	0.75 ± 0.12	12.1 ± 4.4
3	√	×	×	2.40 ± 0.56	0.71 ± 0.10	15.1 ± 4.2
4	√	×	√	2.58 ± 0.53	0.76 ± 0.09	16.3 ± 3.5
5	×	√	×	2.63 ± 0.53	0.80 ± 0.11	15.8 ± 4.1
6	×	√	√	2.79 ± 0.66	0.86 ± 0.10	17.1 ± 3.5

↑: indicates that a higher value corresponds to better performance.

Table 6

Under matched evaluation conditions, the results in this table are categorized by model types D and G. Except for SGM-EMA and SERGAN, all methods used pre-trained models published in their original papers and were evaluated on the same test set (VB-DMD) with five independent runs. The data in the table are presented as the means ± standard deviations (95% confidence interval).

Method	Type	Training Set	Test Set	PESQ (↑)	ESTOI (↑)	SI-SDR (↑)
MetricGAN+ [62]	D	VB-DMD	VB-DMD	3.13 ± 0.55 (±0.68)	0.83 ± 0.11 (±0.14)	8.5 ± 3.6 (±4.5)
SERGAN [63]	D	VB-DMD	VB-DMD	2.62 ± 0.63 (±0.78)	0.85 ± 0.06 (±0.08)	17.2 ± 3.2 (±4.0)
CMGAN [64]	D	VB-DMD	VB-DMD	3.41 ± 0.68 (±0.84)	0.88 ± 0.10 (±0.12)	18.4 ± 4.2 (±5.2)
RVAE [65]	G	VB	VB-DMD	2.48 ± 0.55 (±0.68)	0.81 ± 0.11 (±0.14)	17.1 ± 5.0 (±6.2)
CDiffuse [66]	G	VB-DMD	VB-DMD	2.46 ± 0.51 (±0.63)	0.79 ± 0.11 (±0.14)	12.6 ± 5.0 (±6.2)
SGM-EMA [ours]	G	VB-DMD	VB-DMD	2.89 ± 0.80 (±0.99)	0.91 ± 0.07 (±0.09)	19.6 ± 3.5 (±4.3)

↑: indicates that a higher value corresponds to better performance.

Table 7

For the speech enhancement results under mismatch conditions, the table is sorted according to model types D and G. All methods except SGM-EMA and SERGAN used the pre-trained models published in the original paper and performed 5 independent runs on the same test set (TIMIT-TUT), reporting their means and standard deviations (confidence intervals).

Method	Type	Training Set	Test Set	PESQ (↑)	ESTOI (↑)	SI-SDR (↑)
MetricGAN+ [62]	D	VB-DMD	TIMIT-TUT	1.54 ± 0.53 (±0.66)	0.62 ± 0.12 (±0.15)	4.1 ± 3.1 (±3.9)
SERGAN [63]	D	VB-DMD	TIMIT-TUT	2.17 ± 0.58 (±0.72)	0.75 ± 0.07 (±0.09)	10.0 ± 2.8 (±3.5)
CMGAN [64]	D	VB-DMD	TIMIT-TUT	2.65 ± 0.60 (±0.75)	0.75 ± 0.09 (±0.11)	16.2 ± 3.6 (±4.5)
RVAE [65]	G	VB	TIMIT-TUT	2.24 ± 0.49 (±0.61)	0.72 ± 0.11 (±0.14)	13.8 ± 4.8 (±6.0)
CDiffuse [66]	G	VB-DMD	TIMIT-TUT	1.71 ± 0.42 (±0.52)	0.69 ± 0.11 (±0.14)	10.6 ± 3.2 (±4.0)
SGM-EMA [ours]	G	VB-DMD	TIMIT-TUT	2.79 ± 0.66 (±0.82)	0.86 ± 0.10 (±0.12)	17.1 ± 3.5 (±4.4)

↑: indicates that a higher value corresponds to better performance.

References

1. Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process.; 1984; 32, pp. 1109-1121. [DOI: https://dx.doi.org/10.1109/TASSP.1984.1164453]

2. Soares, A.S.P.; Parreira, W.D.; Souza, E.G.; do Nascimento, C.d.D.; Almeida, S.J.M.D. Voice activity detection using generalized exponential kernels for time and frequency domains. IEEE Trans. Circuits Syst. I Regul. Pap.; 2019; 66, pp. 2116-2123. [DOI: https://dx.doi.org/10.1109/TCSI.2019.2895771]

3. May, A.; Garakani, A.B.; Lu, Z.; Guo, D.; Liu, K.; Bellet, A.; Fan, L.; Collins, M.; Hsu, D.; Kingsbury, B. . Kernel approximation methods for speech recognition. J. Mach. Learn. Res.; 2019; 20, pp. 1-36.

4. Boll, S. Suppression of Acoustic Noise in Speech Using Spectral Subtraction. IEEE Trans. Acoust. Speech Signal Process; 1979; 27, pp. 113-120. [DOI: https://dx.doi.org/10.1109/TASSP.1979.1163209]

5. Yadava, T.; Nagaraja, B.G.; Jayanna, H.S. A spatial procedure to spectral subtraction for speech enhancement. Multimed. Tools Appl.; 2022; 81, pp. 23633-23647.

6. Ioannides, G.; Rallis, V. Real-Time Speech Enhancement Using Spectral Subtraction with Minimum Statistics and Spectral Floor. arXiv; 2023; arXiv: 2302.10313

7. Li, C.; Jiang, T.; Wu, S. Single-channel speech enhancement based on improved frame-iterative spectral subtraction in the modulation domain. China Commun.; 2021; 18, pp. 100-115. [DOI: https://dx.doi.org/10.23919/JCC.2021.09.009]

8. Ephraim, Y.; Van Trees, H.L. A Signal Subspace Approach for Speech Enhancement. IEEE Trans. Speech Audio Process; 1995; 3, pp. 251-266. [DOI: https://dx.doi.org/10.1109/89.397090]

9. Asano, F.; Hayamizu, S.; Yamada, T.; Suzuki, Y.; Sone, T. Speech enhancement based on the subspace method. IEEE Trans. Speech Audio Process.; 2000; 8, pp. 497-507. [DOI: https://dx.doi.org/10.1109/89.861364]

10. Chen, J.; Benesty, J.; Huang, Y.; Doclo, S. New Insights into the Noise Reduction Wiener Filter. IEEE Trans. Audio Speech Lang. Process; 2006; 14, pp. 1218-1234. [DOI: https://dx.doi.org/10.1109/TSA.2005.860851]

11. Abd El-Fattah, M.A.; Dessouky, M.I.; Abbas, A.M.; Diab, S.M.; El-Rabaie, S.M.; Al-Nuaimy, W.; Alshebeili, S.A.; Abd El-Samie, F.E. Speech enhancement with an adaptive Wiener filter. Int. J. Speech Technol.; 2014; 17, pp. 53-64. [DOI: https://dx.doi.org/10.1007/s10772-013-9205-5]

12. Jadda, A.; Prabha, I.S. Speech enhancement via adaptive Wiener filtering and optimized deep learning framework. Int. J. Wavelets Multiresolution Inf. Process.; 2023; 21, 2250032. [DOI: https://dx.doi.org/10.1142/S0219691322500321]

13. Garg, A. Speech enhancement using long short term memory with trained speech features and adaptive wiener filter. Multimed. Tools Appl.; 2023; 82, pp. 3647-3675. [DOI: https://dx.doi.org/10.1007/s11042-022-13302-3] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35855772]

14. Jaiswal, R.K.; Yeduri, S.R.; Cenkeramaddi, L.R. Single-channel speech enhancement using implicit Wiener filter for high-quality speech communication. Int. J. Speech Technol.; 2022; 25, pp. 745-758. [DOI: https://dx.doi.org/10.1007/s10772-022-09987-4]

15. Martin, R. Speech enhancement based on minimum mean-square error estimation and supergaussian priors. IEEE Trans. Speech Audio Process; 2005; 13, pp. 845-856. [DOI: https://dx.doi.org/10.1109/TSA.2005.851927]

16. Wang, Z.; Zhang, T.; Shao, Y.; Ding, B. LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl. Acoust.; 2021; 172, 107647. [DOI: https://dx.doi.org/10.1016/j.apacoust.2020.107647]

17. Ephraim, Y.; Malah, D.; Juang, B.H. On the application of hidden Markov models for enhancing noisy speech. IEEE Trans. Acoust. Speech Signal Process.; 1989; 37, pp. 1846-1856. [DOI: https://dx.doi.org/10.1109/29.45532]

18. Kundu, A.; Chatterjee, S.; Murthy, A.S.; Sreenivas, T.V. GMM based Bayesian approach to speech enhancement in signal/transform domain. Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing; Las Vegas, NV, USA, 31 March–4 April 2008.

19. Ochieng, P. Deep neural network techniques for monaural speech enhancement and separation: State of the art analysis. Artif. Intell. Rev.; 2023; 56, pp. 3651-3703. [DOI: https://dx.doi.org/10.1007/s10462-023-10612-2]

20. Ribas, D.; Miguel, A.; Ortega, A.; Lleida, E. Wiener filter and deep neural networks: A well-balanced pair for speech enhancement. Appl. Sci.; 2022; 12, 9000. [DOI: https://dx.doi.org/10.3390/app12189000]

21. Zhang, W.; Saijo, K.; Wang, Z.-Q.; Watanabe, S.; Qian, Y. Toward universal speech enhancement for diverse input conditions. Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2023); Taipei, Taiwan, 16–20 December 2023.

22. Tan, K.; Wang, D.L. Towards model compression for deep learning based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.; 2021; 29, pp. 1785-1794. [DOI: https://dx.doi.org/10.1109/TASLP.2021.3082282]

23. Jannu, C.; Vanambathina, S.D. Shuffle attention u-net for speech enhancement in time domain. Int. J. Image Graph.; 2024; 24, 2450043. [DOI: https://dx.doi.org/10.1142/S0219467824500438]

24. Wang, K.; He, B.; Zhu, W.P. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain. Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Toronto, ON, Canada, 6–11 June 2021; pp. 7098-7102.

25. Zhang, Q.; Song, Q.; Ni, Z.; Nicolson, A.; Li, H. Time-frequency attention for monaural speech enhancement. Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Online, 7–13 May 2022; pp. 7852-7856.

26. Chen, Y.T.; Wu, Z.T.; Hung, J.W. Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Appl. Sci.; 2023; 13, 5992. [DOI: https://dx.doi.org/10.3390/app13105992]

27. Tang, C.; Luo, C.; Zhao, Z.; Xie, W.; Zeng, W. Joint Time-Frequency and Time Domain Learning for Speech Enhancement. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence; Montreal, QC, Canada, 19–27 August 2021; pp. 3816-3822.

28. Zhou, L.; Gao, Y.; Wang, Z.; Li, J.; Zhang, W. Complex spectral mapping with attention based convolution recurrent neural network for speech enhancement. arXiv; 2021; arXiv: 2104.05267

29. Pang, J.; Li, H.; Jiang, T.; Li, J.; Zhang, W. A Dual-Channel End-to-End Speech Enhancement Method Using Complex Operations in the Time Domain. Appl. Sci.; 2023; 13, 7698. [DOI: https://dx.doi.org/10.3390/app13137698]

30. Pandey, A.; Wang, D.L. Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.; 2021; 2, pp. 1270-1279. [DOI: https://dx.doi.org/10.1109/TASLP.2021.3064421]

31. Saleem, N.; Gunawan, T.S.; Dhahbi, S.; Bourouis, S. Time domain speech enhancement with CNN and time-attention transformer. Digit. Signal Process.; 2024; 147, 104408. [DOI: https://dx.doi.org/10.1016/j.dsp.2024.104408]

32. Phan, H.; Le Nguyen, H.; Chén, O.Y.; Koch, P.; Duong, N.Q.K.; McLoughlin, I.; Mertins, A. Self-attention generative adversarial network for speech enhancement. Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Toronto, ON, Canada, 6–11 June 2021; pp. 7103-7107.

33. Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based metric GAN for speech enhancement. arXiv; 2022; arXiv: 2203.15149

34. Hamdan, M.; Punjabi, P. Generative Adversarial Networks for Speech Enhancement. Proceedings of the 2024 7th International Conference on Signal Processing and Information Security (ICSPIS); Dubai, United Arab Emirates, 12–14 November 2024; pp. 1-5.

35. Xiang, Y.; Højvang, J.L.; Rasmussen, M.H.; Christensen, M.G. A two-stage deep representation learning-based speech enhancement method using variational autoencoder and adversarial training. IEEE/ACM Trans. Audio Speech Lang. Process.; 2023; 32, pp. 164-177. [DOI: https://dx.doi.org/10.1109/TASLP.2023.3321975]

36. Halimeh, M.; Kellermann, W. Complex-valued spatial autoencoders for multichannel speech enhancement. Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Online, 7–13 May 2022; pp. 261-265.

37. Zhang, J.; Jayasuriya, S.; Berisha, V. Restoring degraded speech via a modified diffusion model. arXiv; 2021; arXiv: 2104.11347

38. Serrà, J.; Pascual, S.; Pons, J.; Araz, R.O.; Scaini, D. Universal speech enhancement with score-based diffusion. arXiv; 2022; arXiv: 2206.03065

39. Richter, J.; Frintrop, S.; Gerkmann, T. Audio-visual speech enhancement with score-based generative models. Proceedings of the Speech Communication, 15th ITG Conference; Aachen, Germany, 20–22 September 2023; pp. 275-279.

40. Yang, M.; Zhang, C.; Xu, Y.; Wang, H.; Raj, B.; Yu, D. Usee: Unified speech enhancement and editing with conditional diffusion models. Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Seoul, Republic of Korea, 14–19 April 2024; pp. 7125-7129.

41. Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv; 2020; arXiv: 2011.13456

42. Anderson, B.D.O. Reverse-time diffusion equation models. Stoch. Process. Their Appl.; 1982; 12, pp. 313-326. [DOI: https://dx.doi.org/10.1016/0304-4149(82)90051-5]

43. Särkkä, S.; Solin, A. Applied Stochastic Differential Equations; Cambridge University Press: Cambridge, UK, 2019.

44. Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput.; 2011; 23, pp. 1661-1674. [DOI: https://dx.doi.org/10.1162/NECO_a_00142]

45. Hyvärinen, A.; Dayan, P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res.; 2005; 6, pp. 695-709.

46. Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. arXiv; 2019; arXiv: 1907.05600

47. Lin, J.; van Wijngaarden, A.J.L.; Wang, K.C.; Smith, M.C. Speech enhancement using multi-stage self-attentive temporal convolutional networks. IEEE/ACM Trans. Audio Speech Lang. Process.; 2021; 29, pp. 3440-3450. [DOI: https://dx.doi.org/10.1109/TASLP.2021.3125143]

48. Xu, X.; Hao, J. Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement. arXiv; 2021; arXiv: 2101.05975

49. Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. Proceedings of the ICASSP 2023—023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Rhodes, Greece, 4–10 June 2023; pp. 1-5.

50. Lin, Z.; Chen, X.; Wang, J. MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement. arXiv; 2024; arXiv: 2406.04589

51. Ahmed, S.; Chen, C.W.; Ren, W.; Li, C.-J.; Chu, E.; Chen, J.-C.; Hussain, A.; Wang, H.-M.; Tsao, Y.; Hou, J.-C. Deep complex u-net with conformer for audio-visual speech enhancement. arXiv; 2023; arXiv: 2309.11059

52. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention, Proceedings of the MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18 Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234-241.

53. Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv; 2018; arXiv: 1809.11096

54. Wu, Y.; He, K. Group Normalization. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 3-19.

55. Zhang, R. Making convolutional networks shift-invariant again. Proceedings of the International Conference on Machine Learning; Long Beach, CA, USA, 9–15 June 2019; pp. 7324-7334.

56. Richter, J.; Welker, S.; Lemercier, J.M.; Gerkmann, T. Speech enhancement and dereverberation with diffusion-based generative models. IEEE/ACM Trans. Audio Speech Lang. Process.; 2023; 31, pp. 2351-2364. [DOI: https://dx.doi.org/10.1109/TASLP.2023.3285241]

57. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Online, 14–19 June 2020; pp. 8110-8119.

58. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv; 2017; arXiv: 1706.03762

59. Botinhao, C.V.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. Proceedings of the 9th ISCA Speech Synthesis Workshop; Sunnyvale, CA, USA, 13–16 September 2016; pp. 159-165.

60. Zue, V.; Seneff, S.; Glass, J. Speech database development at MIT: TIMIT and beyond. Speech Commun.; 1990; 9, pp. 351-356. [DOI: https://dx.doi.org/10.1016/0167-6393(90)90010-7]

61. Song, Y.; Ermon, S. Improved techniques for training score-based generative models. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 12438-12448.

62. Fu, S.W.; Yu, C.; Hsieh, T.A.; Plantinga, P.; Ravanelli, M.; Lu, X.; Tsao, Y. Metricgan+: An improved version of metricgan for speech enhancement. arXiv; 2021; arXiv: 2104.03538

63. Baby, D.; Verhulst, S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty. Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Brighton, UK, 12–17 May 2019; pp. 106-110.

64. Abdulatif, S.; Cao, R.; Yang, B. Cmgan: Conformer-based metric-gan for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.; 2024; 32, pp. 2477-2493. [DOI: https://dx.doi.org/10.1109/TASLP.2024.3393718]

65. Bie, X.; Leglaive, S.; Alameda-Pineda, X.; Girin, L. Unsupervised speech enhancement using dynamical variational autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process.; 2022; 30, pp. 2993-3007. [DOI: https://dx.doi.org/10.1109/TASLP.2022.3207349]

66. Lu, Y.J.; Wang, Z.Q.; Watanabe, S.; Richard, A.; Yu, C.; Tzao, Y. Conditional diffusion probabilistic model for speech enhancement. Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Online, 7–13 May 2022; pp. 7402-7406.

67. Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs. Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing; Salt Lake City, UT, USA, 7–11 May 2001; Cat. No. 01CH37221 Volume 2, pp. 749-752.

68. Jensen, J.; Taal, C.H. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans. Audio Speech Lang. Process.; 2016; 24, pp. 2009-2022. [DOI: https://dx.doi.org/10.1109/TASLP.2016.2585878]

69. Le Roux, J.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR–half-baked or well done?. Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Brighton, UK, 12–17 May 2019; pp. 626-630.

Word count: 7711

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The score-based diffusion model has made significant progress in the field of computer vision, surpassing the performance of generative models, such as variational autoencoders, and has been extended to applications such as speech enhancement and recognition. This paper proposes a U-Net architecture using a score-based diffusion model and an efficient multi-scale attention mechanism (EMA) for the speech enhancement task. The model leverages the symmetric structure of U-Net to extract speech features and captures contextual information and local details across different scales using the EMA mechanism, improving speech quality in noisy environments. We evaluate the method on the VoiceBank-DEMAND (VB-DMD) dataset and the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus–TUT Sound Events 2017 (TIMIT-TUT) dataset. The experimental results show that the proposed model performed well in terms of speech quality perception (PESQ), extended short-time objective intelligibility (ESTOI), and scale-invariant signal-to-distortion ratio (SI-SDR). Especially when processing out-of-dataset noisy speech, the proposed method achieved excellent speech enhancement results compared to other methods, demonstrating the model’s strong generalization capability. We also conducted an ablation study on the SDE solver and the EMA mechanism, and the results show that the reverse diffusion method outperformed the Euler–Maruyama method, and the EMA strategy could improve the model performance. The results demonstrate the effectiveness of these two techniques in our system. Nevertheless, since the model is specifically designed for Gaussian noise, its performance under non-Gaussian or complex noise conditions may be limited.

Details

Title

SGM-EMA: Speech Enhancement Method Score-Based Diffusion Model and EMA Mechanism

Author

Wu, Yuezhou; Li Zhiri; Huang, Hua

First page

5243

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app15105243

ProQuest document ID

3211858530

SGM-EMA: Speech Enhancement Method Score-Based Diffusion Model and EMA Mechanism

Jump to:

Full Text

1. Introduction

2. Methodology

2.1. Diffusion Process Based on SDE

2.2. Numerical SDE Solver

2.3. EMA Mechanism

2.4. Multi-Resolution U-Net Network

3. Network Model Structure

4. Experimental Setup

4.1. Dataset

4.2. Hyperparameter Configuration

4.3. Baseline

4.4. Evaluation Metrics

4.5. Ablation Study

5. Experimental Results and Discussion

5.1. Experimental Results

5.2. Trade-Offs in Model Design

6. Conclusions

Abstract

Details

Suggested sources