Content area
The rapid development of deep learning has led to significant strides in image denoising research and has achieved advanced denoising performance in terms of distortion metrics. However, most denoising models that construct loss functions based on pixel-by-pixel differences cause phenomena, such as blurred edges or over-smoothing in denoised images, unsatisfactory to human perception. Our approach to addressing this issue involves prioritizing visual perceptual quality and efficiently restoring high-frequency details that may have been lost during the point-by-point denoising process, all the while preserving the overall structure of the image. We introduce a structure preserved network to generate cost-effective initial predictions that are subsequently incorporated into a conditional diffusion model as a constraint that closely aligns with the actual images. This allows us to more accurately estimate the distribution of clean images by diffusing from the residuals. We observe that by maintaining image consistency in the initial prediction, we can use a residual diffusion model with lower complexity and fewer iterations to restore the detailed texture for the smoothed parts, ultimately leading to a denoised image sample that is more consistent with the visual perceptual quality. Our method is superior in matching human perceptual metrics, e.g. FID, and maintains its performance even at high noise levels, enabling the preservation of the sharp edge and texture features of the image, while reducing computational costs and equipment requirements. This not only achieves the objective of denoising but also results in enhanced subjective visual effects.
Introduction
Image denoising is a low-level vision task of reconstructing a clean image by removing noise from a degraded image. It is not only one of the fundamental issues of image restoration, but also plays an important role in many high-level vision tasks, such as classification, segmentation and detection. Traditional denoising techniques, such as BM3D [1], LSSC [2], NCSR [3], WNNM [4], rely on prior knowledge of the image structure or noises to recover as much information from the original images as possible. Following that, approaches based on discriminative learning have been suggested to train image prior models such as CSF [5] and TNRD [6]. Although these methods have made significant strides in enhancing both computational efficiency and denoising performance, they are constrained by the specified form of the prior and the requirement to manually set the parameters. In recent years, with the development of deep learning techniques, many cutting-edge methods have been proposed or borrowed from individual modules or building blocks to build networks, such as residual learning, dense connections, hierarchical structures, multi-stage frameworks and attention mechanisms, which have contributed significantly to the field of image denoising.
Deep learning-based denoising methods typically utilize a loss function to quantify the difference between denoised image and clean image and the most prevalent loss function used for this purpose is the mean squared error (MSE). However, the denoising models relying on this pixel-wise reconstruction loss suffer from over-smoothing or the loss of useful visual detail in the denoised images when dealing with high levels of noise in the input. This is due to the fact that these models predict clean images by averaging multiple potential samples. In other words, these models output the expectation of the underlying clean images conditioning on the noisy input. The denoising conditional expectation indeed reduces the variance of noise and enhances denoising performance, but it also decreases the variance of the clean image details, inevitably leading to the loss of high-frequency information. In particular, when the image texture details are similar to the noise distribution, the network may mistakenly classify these texture details as noise and remove them. In addition, these methods often prioritize the optimization of distortion metrics, such as PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Measure) [39], while neglecting other image quality metrics that better align with human visual perception. Though PSNR is a popular image quality metric, it solely relies on MSE to measure image differences and fails to capture differences in pixel-level details and edges with accuracy. To obtain a more precise evaluation of denoised image quality, it is essential to incorporate other evaluation metrics in addition to PSNR, in order to provide a comprehensive assessment.
Compared to the aforementioned approach, some methods attempt to directly sample from a pool of clean images that conform to their distribution. For instance, generative adversarial networks (GANs) [7] are utilized to facilitate image denoising, where adversarial training is employed to enhance denoising performance by sampling from the maximum posterior distribution [8, 9, 14]. Although this approach mitigates the issue of weighted averaging, it does have some limitations. The training process of GAN is more complex than the traditional supervised learning since it requires the balancing of the realism and the diversity of generated images through adversarial loss. Adversarial loss usually introduces artifacts that are not present in the original clean image, causing training instability or gradient disappearance. Moreover, GAN-generated images tend to exhibit high diversities, which may not always be desirable in image denoising tasks. Instead, consistency and controllability of the denoised images are often preferred.
In this work, we propose a detail-aware denoising model (PRTD), which can reconstruct the denoised image with more details, while maintaining structural consistency to the original image. The PRTD is a two-stage image denoising model and consists of a structure preserved network and a residual diffusion model, as shown in Fig. 1. In the first stage, an end-to-end denoising network is trained with MSE loss to predict an initial denoised result that preserves the primary structure of the underlying clean image. We do not use the state-of-the-art transformer-based denoising model to maintain image consistency because this would incur expensive computational costs. Instead, we introduce a new image denoising network based on MLPs (UmlpNet) as a structure preserved network for rapidly generating initial predictions. In the second stage, we use the initial prediction as a condition for the diffusion probability model [10, 11]. By learning the residual distribution between the clean image and the initial prediction, the residual diffusion model is able to reconstruct the texture details that were lost in the first stage. The denoising network has a lower network complexity compared to the standard diffusion model as the conditional images have already achieved consistency with the original images. Therefore, high-quality samples can be generated with less computational costs. By integrating the two complementary methods, PRTD is able to effectively capture complex relationships between noisy input images and their corresponding clean images, generating realistic denoised images that retain critical details and structural consistency. Furthermore, we aim to assess the quality of the generated images using LPIPS [37] and FID [38] metrics, which have been shown to better reflect the human perception.
Fig. 1 [Images not available. See PDF.]
Architecture of two-stage denoising model. The structure preserved network predicts an initial denoised result for the residual diffusion model to reconstruct finer texture details
Contributions of this work can be summarized as follows:
We propose a two-stage image denoising method consisting of a structure preserved network and residual diffusion model (PRTD). It is capable of effectively restoring high-frequency detail lost due to the pixel-wise reconstruction loss of denoising models, while preserving the main structure of the image. Even when dealing with high levels of noise, it effectively maintains both the structural consistency and the detail reality of the denoised image. Experiments conducted on multiple datasets demonstrate that our proposed method can produce denoised images with clear and realistic textures, leading to a significant improvement in visual perception quality, such as LPIPS and FID.
We propose a lightweight MLP-based image denoising structure preserved network model to quickly obtain initial denoising results. The proposed UmlpNet ensures high-quality denoising results even at high levels of noise while reducing the computational cost. UmlpNet provides a robust initial prediction for subsequent detail diffusion with few computational overhead, thereby enhancing the overall denoising performance of the proposed method.
Related work
Deep neural networks for image denoising
In recent years, many end-to-end approaches that utilize deep neural network training have shown remarkable performance in various image processing tasks. For example, DnCNN [12] used batch normalization and recursive residual learning to speed up the training process and improve denoising performance, and can handle multiple domains, including Gaussian denoising with unknown noise levels, super-resolution, and JPEG image deblocking. FFDNet [48] presented a fast and flexible architecture that denoises images with different noise levels using a single network, striking a good balance between inference speed and denoising performance. IRCNN [47] proposed a flexible and effective framework for image denoised tasks by integrating model-based optimization methods and a discriminative CNN denoiser. DRUNet [50] proposed a flexible and powerful deep CNN denoiser that is well-suited for solving plug-and-play image restoration. CycleISP [62] introduced a framework that models the camera imaging pipeline in both the forward and reverse directions to generate more realistic synthetic noise data for training the denoising network, and it has demonstrated excellent performance in removing real image noise. MPRNet [63] introduced an enhanced multi-stage architecture that integrates high-level global features and local details for better performance. SwinIR [51] presented an architecture based on Swin transformers for image recovery, which enables the extraction of deep features to reconstruct high-quality images. Restormer [13] proposed an effective transformer that achieves superior performance on multiple image restoration tasks through several key designs in multi-headed attention and feedforward networks.
These end-to-end neural network approaches effectively ensure a consistent image structure and lead to the state-of-the-art PSNR by improving network modules and loss functions. However, only distortion metrics are optimized in these approaches, with the crucial aspect of visual perceptual quality still ignored. These networks generate deterministic outputs by multiple noiseless samples after being trained with fixed weights. The final output is a weighted average of these samples, which causes an unavoidable loss of high-frequency information in the image.
Deep generative model for image restoration
Deep generative models, which are powerful deep learning models that can generate realistic images from noise or other inputs, have been increasingly used to tackle various image restoration tasks in recent years. Many GAN-based models have been proposed to address the challenges of image denoising, such as the lack of paired training data and the loss of details. For example, Guy et al. [14] proposed a novel constraint to the conditional generative adversarial network (CGAN) framework, which alleviates the difficulty of training with high-dimensional distributions and achieves high-quality denoising results with excellent perceptual quality while maintaining an acceptable level of distortion. Chen et al. [15] proposed a GAN-CNN framework for improving the performance of image blind denoising, where GAN was used to solve the key problem of building paired training datasets. In addition, autoregressive models [16, 17], variational autoencoders (VAEs) [18] and normalizing flows (NFs) [19, 20] have demonstrated remarkable image generation performance and have been effectively applied to various image processing tasks.
However, these generative modeling approaches still face several limitations and challenges. The use of adversarial loss in GANs [7] may lead to incomplete features and information loss in the final generated results. Additionally, GANs are often vulnerable to several training issues, such as training instability, gradient disappearance, and mode collapse. Compared to GANs, autoregressive models can avoid information loss but suffer from one-way bias, which can result in a lower quality of the generated samples. VAEs may not be efficient for high-dimensional data, and NFs need more computational resources and training time.
Diffusion-based image restoration
Diffusion probability models (DPMs) [10, 11] have recently shown remarkable outcomes in image generation. Consequently, numerous researchers have proposed the idea of employing a posteriori sampling for image restoration tasks. For instance, Saharia et al. [34] proposed a conditional diffusion model-based image super-resolution approach, which remarkably enhanced the image quality by using low-resolution images as a condition. Özdenizci et al. [21] introduced a patch-based diffusion modeling approach that allows for the restoration of images affected by various weather conditions (snow or rain, for example), through analyzing features such as brightness and by contrasting the images. Whang et al. [22]presented a stochastic refinement framework for image deblurring that can be achieved without sacrificing image quality and details. DPM performs well in generating realistic images in a given condition, and faster samplers such as DDIM [23] and DPM-Solver [24] have emerged to generate high-quality samples in less time, but diffusion models still face challenges of long training time and high hardware requirements.
Compared to end-to-end training approaches, diffusion models offer a promising solution to many challenging problems in image restoration by learning and capturing the complex relationships between image features and their underlying distributions. This success can be attributed to their ability to create complex mappings between input and output spaces. By generating samples from the posterior distribution, diffusion models can account for uncertainties in the imaging inverse problem, produce multiple samples and recover detailed textures masked by high-frequency information. As a result, high-quality imaging results with improved efficiency and generalization capabilities are produced.
Preliminaries
Luo et al. [25] have successfully employed the diffusion model to tackle image denoising tasks. The diffusion model is widely recognized as an effective image enhancement technique, especially for its ability to enhance image details. Before a brief description of basic concepts of DPM, it should be emphasized that a modified formulation is adopted in this paper based on a continuous noise level by Chen et al. [26] to generate noise that increases linearly with time during the forward diffusion process. This enables us to directly sample the noise schedule , without additional training of the model. Moreover, it enables us to adjust the number of iterations used in the inference process, thus allowing for a trade-off between computation time and output quality.
DPM [10, 11] defines a forward Markovian diffusion process. Given the data sampled from the real data distribution, Gaussian noise is repeatedly added to this sample over steps, resulting in a series of samples obtained by noise superposition. Each iteration of forward diffusion process can be expressed as follows:
1
where for all . The noise schedule is a hyperparameter that controls the variance of noise added at each step.The latent variables have the same dimensionality as the original data sample . This means that we can simplify the intermediate steps and directly obtain any moment of from :
2
where . Furthermore, with algebraic manipulation and square completion, one can derive the posterior distribution of as a Gaussian transformation with a learned mean and variance:3
where mean , and variance .The inverse diffusion process recovers the original image from Gaussian noise. As the noise added at each step in the forward process is small, the inverse process can also be modeled as a Markov chain process. This implies that it is possible to sample a partially noisy image at any time step and to define the inference process:
4
Because are learned, the denoising network is a key component of the inverse process, as it is designed to estimate the clean image from the partially noisy image . Through training , we can estimate the noise in each step of the diffusion process. Each iteration takes the following form:
5
where . Each iteration of the model takes this form. At the end of the sampling process, the target sample is obtained.So far, we have described a DPM that is trained to learn the data distribution without any conditions. In this work, we focus on using a conditional diffusion model in the form of , where the conditional information is added to each diffusion step. The inverse process is then represented as:
6
which differs from Eq. (4) by adding the additional condition . In addition to a source image and the noise variance , the denoising model also accepts conditional information as input and is trained to predict the noise vector .
We employ as an auxiliary input to optimize , following a similar approach to that of [22, 34]. The whole process is adapted by incorporating as an input, and the ultimate training objective is as follows:
7
Method
The basic principles of the diffusion model have been previously introduced, and the traditional DPM diffuses degraded images directly, requiring a large number of prediction steps of complicated neural networks to estimate the distribution of clean images, which in turn results in a high computational cost. Chung et al. [27] demonstrated that accurate and fast image reconstructions can be produced by providing better initial estimates. To tackle the issue of higher computational costs associated with iterative denoising process, we adopt a simple yet effective approach by providing an initial prediction for the conditional diffusion model, which ensures that the initial state is structurally similar to the clean image. Our research has revealed that learning the residual distribution between the clean image and the initial prediction with the diffusion model is easier than directly learning the distribution of the clean image. This is because the residual distribution can better capture the lost detail information in the clean image, which can be effectively recovered by sampling from the residual distribution learned by the diffusion model. We have also found that the use of high-quality initial predictions as a condition for the diffusion model enhances the quality of the residual distribution learning. Consequently, the ability to recover fine details is also improved. Furthermore, our structure preserved network UmlpNet is highly efficient and cost-effective in generating initial predictions, which helps to further minimize the computational overhead of the overall model.
In summary, we propose PRTD consisting of a structure preserved network and residual diffusion, as shown in Fig. 1. We decompose the task of directly restoring noisy images into two simpler sub-problems. The primary objective of the first sub-problem is to ensure the consistency in the image structure, whereas the second sub-problem aims to generate more details and texture information to satisfy the realism of the data distribution. When both sub-problems are solved effectively, the final results exhibit improved consistency and reality.
Structure preserved network
Many MLP-based networks have been proposed for image processing tasks in recent years. For example, Burger et al. [28] have successfully applied MLP to image denoising. Tu et al. [29] presented a multi-axis MLP-based architecture (MAXIM) that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. Valanarasu et al. [30] proposed a convolutional multilayer perceptron-based network (UNeXt) for image segmentation, which reduces the number of parameters and computational complexity while producing a better result. Inspired by these works, we propose the UmlpNet, a lightweight MLP-based image denoising network that functions as a structure preserved network. It addresses the issue of high computational cost and complexity in many existing image denoising networks.
Figure 2 illustrates the architecture of the complete UmlpNet. To minimize the inter-block complexity, the architecture adopts a single-stage U-shaped network that gradually reduces the spatial size with each layer while exponentially increasing the number of channels. The input image is divided into non-overlapping blocks and is fed into the gated MLP (gMLP) block [31]. This block is based on MLPs with gating and leverages spatial projection with static parameterization multiplied by the linear channel projection to extract deeper feature information from the image. An important part of gMLP block is the Spatial Gating Unit (SGU). To be more specific, let be the input feature and represent the projected features after the GELU activation. The SGU part can be expressed as:
8
where are two independent parts split from along the channel dimension. denotes element-wise multiplication. is layer normalization. is the spatial projection matrix and refers token-specific biases.Fig. 2 [Images not available. See PDF.]
Architecture of UmlpNet. Serving as a structure preserved network, it is characterized by a single-stage U-shaped network design that integrates efficient Gated MLP (gMLP) blocks, with the Spatial Gating Unit (SGU) serving as the core component of these blocks. The jump connection module is represented by the sMLP block, which combines a channel shuffle operation and an MLP block
To prevent the degradation of network performance with the increasing depth, we integrate the channel shuffle operation [32] into the jump connection component. This operation enhances information flow between channels, ensuring that input and output channels are fully correlated. Specifically, the channels are first divided into subgroups based on the feature map generated from the previous layer, and then, the channel order of the original feature map is disrupted by transposing and flattening as input to the next layer.
We introduce the sMLP block, which combines channel shuffle operation and MLP block, as a jump connection module. This allows low-level features to be aggregated with high-level features before being fed to the decoder, and the channel shuffling operation incurs no additional computational cost. UmlpNet ensures that the denoised image maintains structural consistency with the ground truth, thereby providing a high-quality condition for the residual diffusion model.
Residual diffusion model
In the second stage, we use a conditional diffusion model in the form of , as shown in blue box of Fig. 1. Specifically, when given a noisy input image , we obtain a clean image . The structure preserved network generates an initial estimate . The forward diffusion process begins with a residual and gradually adds the Gaussian noise to the residual through a fixed Markov chain; it can be expressed as:
9
where is the noisy residual image at time . As emphasized in Sect. 3, we adopt a modified formulation based on a continuous noise level inspired by Chen et al. [26], and it is noted that the technique can also be used similarly in other methods [22, 34].Algorithm 1 Sampling process of residual diffusion model
Next, the inverse diffusion process performs a stepwise recovery of . The residual diffusion model takes in the noisy residual image and the initial prediction as input, and iteratively learns the conditional transition distribution through a reverse Markov chain conditioned on . The form of each iteration changes from Eq. (5) above to the following form:
10
where , is neural denoising network with reduced complexity.Following iterations, we obtain the final residual estimates, and we leverage the outcomes of the pre-trained structure preserved network to achieve optimal results. Based on the above theory, we change the final objective Eq. (7) to:
11
The lightweight structure preserved network generates initial predictions that guide the residual diffusion process, enabling the effective reconstruction of missing detailed textures and a reduction in the training cost. Algorithm 1 shows the sampling steps of our proposed residual diffusion model.
The denoising network architecture uses a U-Net similar to that in DPM [10], as shown in Fig. 3. Instead of using the Residual blocks in DPM, we replace them with those in BigGAN [33]. Additionally, we adjust the ratio of jump connections to and reduce the number of channels and network depth. We also find the group normalization and self-attention mechanisms used in [34] are not necessary for the denoising task. Combining these modifications, our denoising network has become more computationally efficient compared to the traditional DPM. The efficiency of the denoising network is critical to the overall computational performance of the model since it needs to be run several times during prediction.
Fig. 3 [Images not available. See PDF.]
U-Net architecture diagram used for denoising networks in residual diffusion model. The noisy residual image is concatenated with the initial prediction as input
Experiment
Training details
Initially, we pre-train the structure preserved network to maintain structural consistency in the processed noisy images. UmlpNet employs a 4-level encoder-decoder. From level-1 to level-4, the number of gMLP blocks are , and the initial number of channels is 32. The structure preserved network is trained on patches with a batch size of 16 for iterations. We use AdamW optimizer [35] (, , weight decay ) with the initial learning rate of , which is steadily decreased to using the cosine annealing strategy [36].
Subsequently, we train the residual diffusion model to fine-tune the initial predictions. Our denoising network employs a U-Net similar to SR3 [34] but removes all group normalization layers and self-attention layers, with five-layer channel multipliers of and a starting channel number of 32. The residual diffusion model is trained on patches with a batch size of 24 for 100 epochs. We use the AdamW optimizer [35] with a fixed learning rate of without weight decay and employed a linear noise schedule where the two endpoints were set as follows: and . We set in the training process.
During the inference phase, aforementioned in Sect. 3, we harness the continuous noise level sampling throughout training. Consequently, during inference, we have the flexibility to employ various noise schedules, potentially yielding samples with diverse distortion-perception tradeoffs. To achieve this, we follow [22], conducting a grid search across a range of inference steps and the noise variance . In the context of inference steps , we set values at 10, 20, 30, 50, 100, 200, 300 and 500. As for the noise schedule , we maintain a fixed initial forward process variance at . The final variance is explored across the range of {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}, with intermediate values being linearly interpolated. To determine the optimal hyperparameter combinations, we conducted a grid search on the test results of the face images with noise levels of 25. Figure 4 presents the findings of this analysis. It is found that varying the sampling parameters yields distinct sample outcomes. We observe that employing multiple steps with minimal noise levels generally enhances perceptual quality. Conversely, opting for fewer steps with higher noise levels tends to result in reduced distortion. Nevertheless, we were also prepared to accept a minor reduction in the average distortion score to retain finer textures. Consequently, the maximum inference budget is set at 100 diffusion steps, which significantly reduces the number of inference steps and facilitates more efficient sampling.
Fig. 4 [Images not available. See PDF.]
Sample collection. Samples are collected from diverse sampling parameter settings, including inference steps and noise variance, with corresponding PSNR values derived from a grid search conducted on face images afflicted by noise levels of 25
We apply Pytorch to train and test the proposed method in image denoising. Specifically, all the experiments are conducted on a PC equipped with an 13th Gen Intel (R) Core (TM) i5-13490F CPU, 32GB of RAM, and an NVIDIA GeForce RTX 3090 GPU.
Evaluation metrics
We primarily employed LPIPS (Learned Perceptual Image Patch Similarity) [37] and FID (Fréchet Inception Distance) [38] metrics to evaluate the performance of our method. To provide a comprehensive evaluation of our method, we also employed both PSNR and SSIM [39] metrics, which are commonly used as objective criteria for assessing image quality in many low-level vision tasks. As noted by Menon et al. [40], the PSNR metric has certain limitations. The PSNR metric assesses the image quality based on the peak signal-to-noise ratio, but this value may not always align with the image quality perceived by the human eye. In certain cases, although the PSNR metric may yield a high score, the image may have lost detailed information or contain noticeable distortion that is detectable to the human eye. Furthermore, the sensitivity of the PSNR metric can vary across different images. In cases where images contain complex textures or details, the PSNR metric may assign a lower score even if the distortion is minimal. This can lead to situations where the PSNR metric fails to accurately reflect the overall image quality.
If we rely solely on distortion metrics, we may not be able to fully assess the perceptual quality of the image, including the level of present details. Therefore, it is important to incorporate metrics that are more closely aligned with human perception in order to obtain a more comprehensive evaluation of the image.
Perception-distortion trade-off
In Fig. 5, we present a perceptual distortion plot on McMaster dataset with noise levels of 25, with samples obtained from various sampling parameters, specifically, step size and standard deviation of noise. Previous studies [22, 34, 41] have observed a fundamental tradeoff between perceived quality and the distortion measure. Distortion metrics, such as PSNR and SSIM, tend to prioritize synthetic high-frequency details that closely match the target image. However, because of the inherent stochastic nature of a posteriori sampling from the generative model, it becomes virtually unattainable to achieve perfectly aligned high-frequency details. Consequently, PSNR and SSIM metrics tend to fall back on mean squared error (MSE) regression-based techniques which exhibit a conservative approach when assessing high-frequency details. In contrast, the majority of outputs generated by the generative model achieve higher scores in terms of sample quality, as evidenced by FID and LPIPS metrics. However, when it comes to PSNR and SSIM, they exhibit comparatively poorer performance compared to methods based on mean squared error (MSE).
Fig. 5 [Images not available. See PDF.]
Perception-distortion plot under varying sampling parameters on McMaster dataset with noise levels of 25. We include two extremes on the plot-one focuses on optimizing perceptual quality and the other on optimizing distortion using sample averages, representing the two endpoints of the P-D curve
In this study, we prioritize the assessment of the perceptual quality of the output; thus, a marginal reduction is accepted in the average distortion score to achieve a more favorable balance between distortion and perception [41]. The second stage of our method involves generating a model and conducting random posterior sampling. It is essential to note that the reference image from the training dataset represents just one of the potential recovery outcomes among various possibilities, primarily because of the ill-posed nature of inverse problems. In a manner akin to previous work [4, 14], our results strike a balance between a certain level of pixel-averaged distortion and reality to the target image. We employ a model, labeled as ‘PRTD’, with a perceived quality at an upper-middle level. It is important to highlight that the FID results we report may not represent the optimal performance, and as such, we do not assert optimality. Furthermore, we also maintain competitive distortion metrics labeled as ‘PRTD-SA’ by taking the average of multiple samples.
Quantitative results
Face image denoising
Based on the requirements of FID metrics, a large number of test images are needed to obtain reliable test results. Therefore, in order to accurately evaluate the detail enhancement effect of PRTD for image denoising, we have decided to use face images with a substantial number of test data for Gaussian denoising experiments. As the structure preserved network aims to achieve consistency, it is important to train it on a diverse range of image data to ensure that the denoised images are structurally consistent with real images. We use 6576 images from the DIV2K [42], BSD500 [43], and Waterloo Exploration Database [44] to train the structure preserved network. The prediction network only needs to be trained once for each type of noise level, and it is highly generalizable and does not require any additional training on the face dataset. The predictions obtain from the structure preserved network are used as inputs to train the residual diffusion model on Flickr-Faces-HQ (FFHQ) [45], and the evaluation of the network is performed on CelebA-HQ [46].
We compare the proposed method with several state-of-the-art denoising methods, including DnCNN [12], IRCNN [47], FFDNet [48], ADNet [49], DRUNet [50], SwinIR [51], SCUNet [52] and Restormer [13]. The test results of different methods on CelebA-HQ [46] for noise levels 15, 25, 50 and 75 are shown in Table 1. It can be seen that a substantial enhancement in perceptual metrics (PRTD) has been brought about in our approach, and a highly competitive distortion score (PRTD-SA) is maintained. Our primary objective is to enhance the visual perceptual quality of images by preserving structural consistency and recovering fine details, rather than merely minimizing the error distance. In particular, PRTD achieves a reduction of approximately 12 in FID compared to SwinIR with a noise level 25. The denoising results of different methods at each noise level are presented in Figs. 6, 7 and 8. To facilitate the comparison with other methods, specific image blocks are selected and enlarged. It is evident from the results that the other methods smooth out the areas where our proposed method is capable of recovering more detailed texture features.
Table 1. PSNR, SSIM, LPIPS and FID results of different methods on CelebA-HQ with noise levels of 15, 25, 50 and 75
Noise level | σ = 15 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | DnCNN [12] | IRCNN [47] | FFDNet [48] | ADNet [49] | DRUNet [50] | SwinIR [51] | SCUNet [52] | Restormer [13] | UmlpNet (ours) | PRTD (ours) | PRTD-SA (ours) | ||||
PSNR↑ | 36.33 | 36.39 | 36.48 | 36.58 | 36.89 | 36.99 | 37.27 | 37.19 | 36.77 | 35.86 | 36.92 | ||||
SSIM↑ | 0.941 | 0.943 | 0.945 | 0.948 | 0.952 | 0.95 | 0.953 | 0.953 | 0.951 | 0.937 | 0.952 | ||||
LPIPS↓ | 0.031 | 0.032 | 0.035 | 0.0318 | 0.029 | 0.027 | 0.027 | 0.027 | 0.029 | 0.026 | 0.028 | ||||
FID↓ | 8.24 | 9.69 | 15.15 | 13.83 | 12.39 | 11.74 | 10.85 | 10.89 | 11.92 | 4.71 | 10.10 | ||||
Noise level | σ = 25 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | DnCNN [12] | IRCNN [47] | FFDNet [48] | ADNet [49] | DRUNet [50] | SwinIR [51] | SCUNet [52] | Restormer [13] | UmlpNet (ours) | PRTD (ours) | PRTD-SA (ours) | ||||
PSNR↑ | 33.98 | 34.02 | 34.19 | 34.26 | 34.69 | 34.72 | 34.77 | 34.78 | 34.43 | 33.59 | 34.81 | ||||
SSIM↑ | 0.912 | 0.914 | 0.917 | 0.922 | 0.926 | 0.926 | 0.927 | 0.927 | 0.925 | 0.907 | 0.927 | ||||
LPIPS↓ | 0.053 | 0.053 | 0.062 | 0.055 | 0.051 | 0.049 | 0.050 | 0.050 | 0.051 | 0.048 | 0.053 | ||||
FID↓ | 13.70 | 15.58 | 24.14 | 21.99 | 21.33 | 20.16 | 20.95 | 21.07 | 20.02 | 7.99 | 15.66 | ||||
Noise level | σ = 50 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | DnCNN [12] | IRCNN [47] | FFDNet [48] | ADNet [49] | DRUNet [50] | SwinIR [51] | SCUNet [52] | Restormer [13] | UmlpNet (ours) | PRTD (ours) | PRTD-SA (ours) | ||||
PSNR↑ | 30.72 | 30.83 | 31.11 | 31.16 | 31.76 | 31.74 | 31.83 | 31.85 | 31.43 | 30.67 | 31.83 | ||||
SSIM↑ | 0.852 | 0.857 | 0.865 | 0.868 | 0.882 | 0.882 | 0.883 | 0.886 | 0.878 | 0.854 | 0.883 | ||||
LPIPS↓ | 0.099 | 0.096 | 0.118 | 0.103 | 0.094 | 0.093 | 0.093 | 0.093 | 0.098 | 0.091 | 0.104 | ||||
FID↓ | 33.50 | 25.20 | 39.01 | 31.04 | 34.40 | 35.05 | 34.59 | 33.34 | 33.13 | 18.97 | 25.09 | ||||
Noise level | σ = 75 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | DnCNN [12] | IRCNN [47] | FFDNet [48] | ADNet [49] | DRUNet [50] | SwinIR [51] | SCUNet [52] | Restormer [13] | UmlpNet (Ours) | PRTD (Ours) | PRTD-SA (Ours) | ||||
PSNR↑ | - | - | 29.27 | 29.34 | - | - | - | - | 29.68 | 28.32 | 29.98 | ||||
SSIM↑ | - | - | 0.826 | 0.825 | - | - | - | - | 0.842 | 0.803 | 0.844 | ||||
LPIPS↓ | - | - | 0.163 | 0.141 | - | - | - | - | 0.134 | 0.125 | 0.132 | ||||
FID↓ | - | - | 48.66 | 35.92 | - | - | - | - | 39.34 | 28.71 | 34.79 | ||||
aThe best results are marked in bold black
Fig. 6 [Images not available. See PDF.]
Visual results of different denoising methods on CelebA-HQ with noise levels of 25. Our method produces images that effectively recovers facial freckle details that tend to be smoothed out by other methods
Fig. 7 [Images not available. See PDF.]
Visual results of different denoising methods on CelebA-HQ with noise levels of 50. Despite the presence of significant noise interference, our method can successfully restore the subtle highlights at the center of eyes, a feat beyond other comparable methods
Fig. 8 [Images not available. See PDF.]
Visual results of different denoising methods on CelebA-HQ with noise levels of 75. Our method effectively alleviates artifacts and restores a well-defined eye contour
To showcase the effectiveness of our method in handling high levels of noise, we have also trained a model with a noise level of 100, as illustrated in Fig. 9. The results demonstrate that although the first stage smooths out a portion of the image, the second stage successfully recovers details such as hair strands and hats. Notably, as the noise level increases, our approach still preserves a considerable amount of image structure and detailed features, even in regions contaminated by noise.
Fig. 9 [Images not available. See PDF.]
Visual results of our method on CelebA-HQ with noise levels of 100. Even at elevated noise levels, our method is still superior in removing noise from the input image and preserving more detailed textures
Gaussian color image denoising
We conduct experiments on a benchmark dataset widely used for evaluating image denoising methods to further validate the effectiveness of our method. We train the residual diffusion model on DIV2K [42], BSD500 [43], Flickr2K [53], and Waterloo Exploration Database [44], and evaluate it on CBSD68 [54], Kodak24 [55], and McMaster [56].
The proposed method is compared with the current methods DnCNN [12], IRCNN [47], FFDNet [48], ADNet [49], IPT [57, SCUNet [52] and Restormer [13]. Since the test dataset does not have a sufficient number of samples to calculate a reliable FID score, we extract non-overlapping patches of size from the test images for evaluation. The PSNR, SSIM, LPIPS and FID results of different methods on CBSD68, Kodak24, and McMaster for noise levels 15, 25 and 50 are shown in Table 2. In contrast to other methods, PRTD demonstrates superior performance in perceptual metrics, whereas PRTD-SA excels in terms of distortion metrics. Specifically, our proposed method achieves notably better performance on CBSD68 dataset which is widely used for evaluating image denoising methods. The denoising outcomes of various methods at each noise level are showcased in Fig. 10 and 11. It is apparent that the results obtained from our proposed method recover more detailed texture features compared to other methods. The superior performance of our proposed approach in both distortion and perceptual metrics demonstrates its potential as a practical and efficient method for image denoising.
Table 2. PSNR, SSIM, LPIPS and FID results of different methods on CBSD68, Kodak24, McMaster with noise levels of 15, 25 and 50
Noise level | σ = 15 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | CBSD68 | Kodak24 | McMaster | ||||||||||
Method | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | |
DnCNN [12] | 33.82 | 0.929 | 0.059 | 33.50 | 34.91 | 0.921 | 0.081 | 53.10 | 33.52 | 0.900 | 0.065 | 74.01 | |
IRCNN [47] | 33.79 | 0.931 | 0.060 | 35.15 | 35.00 | 0.921 | 0.080 | 51.94 | 34.65 | 0.920 | 0.058 | 69.87 | |
FFDNet [48] | 33.80 | 0.928 | 0.063 | 36.25 | 33.07 | 0.923 | 0.085 | 54.61 | 34.73 | 0.922 | 0.062 | 74.31 | |
ADNet [49] | 33.86 | 0.932 | 0.059 | 33.43 | 35.15 | 0.925 | 0.080 | 50.70 | 34.96 | 0.928 | 0.056 | 66.58 | |
IPT [57] | - | - | - | - | - | - | - | - | - | - | - | - | |
SCUNet [52] | 34.33 | 0.935 | 0.052 | 29.61 | 33.81 | 0.931 | 0.070 | 41.49 | 35.67 | 0.935 | 0.049 | 57.23 | |
Restormer [13] | 34.36 | 0.938 | 0.051 | 29.16 | 35.84 | 0.933 | 0.070 | 41.10 | 35.67 | 0.935 | 0.049 | 58.77 | |
UmlpNet (ours) | 33.96 | 0.935 | 0.054 | 29.79 | 35.28 | 0.928 | 0.073 | 41.29 | 34.82 | 0.928 | 0.052 | 59.30 | |
PRTD (ours) | 32.21 | 0.905 | 0.045 | 23.79 | 33.60 | 0.901 | 0.059 | 35.65 | 33.37 | 0.902 | 0.042 | 48.45 | |
PRTD-SA (ours) | 34.19 | 0.935 | 0.053 | 29.97 | 35.47 | 0.929 | 0.070 | 42.13 | 35.07 | 0.930 | 0.049 | 57.45 | |
Noise level | σ = 25 | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | CBSD68 | Kodak24 | McMaster | |||||||||||
Method | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | PSNR↑ | SSIM↓ | LPIPS↓ | FID↓ | ||
DnCNN [12] | 31.16 | 0.882 | 0.104 | 55.29 | 32.54 | 0.878 | 0.127 | 84.48 | 31.61 | 0.870 | 0.096 | 108.93 | ||
IRCNN [47] | 31.10 | 0.882 | 0.105 | 57.36 | 32.55 | 0.879 | 0.126 | 86.33 | 32.28 | 0.883 | 0.089 | 105.10 | ||
FFDNet [48] | 31.14 | 0.881 | 0.118 | 63.69 | 32.67 | 0.880 | 0.139 | 88.29 | 32.45 | 0.887 | 0.099 | 108.85 | ||
ADNet [49] | 31.19 | 0.887 | 0.105 | 56.06 | 32.74 | 0.883 | 0.125 | 79.76 | 32.62 | 0.893 | 0.088 | 100.86 | ||
IPT [57] | - | - | - | - | - | - | - | - | - | - | - | - | ||
SCUNet [52] | 31.72 | 0.893 | 0.093 | 47.20 | 33.50 | 0.894 | 0.112 | 62.16 | 33.44 | 0.906 | 0.077 | 82.24 | ||
Restormer [13] | 31.74 | 0.898 | 0.092 | 47.96 | 33.52 | 0.897 | 0.111 | 61.51 | 33.44 | 0.909 | 0.076 | 84.08 | ||
UmlpNet (Ours) | 31.30 | 0.891 | 0.097 | 48.13 | 32.91 | 0.889 | 0.114 | 62.99 | 32.54 | 0.896 | 0.083 | 83.05 | ||
PRTD (Ours) | 30.81 | 0.881 | 0.097 | 42.79 | 32.33 | 0.877 | 0.110 | 59.65 | 31.89 | 0.878 | 0.077 | 72.56 | ||
PRTD-SA (Ours) | 31.64 | 0.892 | 0.098 | 46.55 | 33.11 | 0.890 | 0.113 | 63.03 | 33.09 | 0.899 | 0.086 | 82.30 | ||
Noise level | σ = 50 | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | CBSD68 | Kodak24 | McMaster | |||||||||||
Method | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | ||
DnCNN [12] | 27.85 | 0.787 | 0.205 | 99.06 | 29.37 | 0.793 | 0.226 | 145.21 | 28.73 | 0.799 | 0.163 | 175.40 | ||
IRCNN [47] | 27.79 | 0.788 | 0.198 | 97.12 | 29.38 | 0.796 | 0.217 | 142.60 | 29.06 | 0.809 | 0.147 | 156.91 | ||
FFDNet [48] | 27.88 | 0.786 | 0.241 | 117.11 | 29.57 | 0.795 | 0.256 | 146.62 | 29.30 | 0.816 | 0.178 | 165.52 | ||
ADNet [49] | 27.93 | 0.800 | 0.206 | 97.92 | 29.66 | 0.799 | 0.221 | 133.17 | 29.46 | 0.823 | 0.155 | 155.11 | ||
IPT [57] | 28.36 | 0.807 | 0.185 | 83.45 | 29.64 | 0.799 | 0.222 | 135.42 | 29.88 | 0.839 | 0.148 | 126.86 | ||
SCUNet [52] | 28.52 | 0.810 | 0.182 | 82.45 | 30.56 | 0.825 | 0.198 | 97.55 | 30.41 | 0.851 | 0.134 | 122.07 | ||
Restormer [13] | 28.54 | 0.816 | 0.177 | 82.86 | 30.57 | 0.825 | 0.192 | 97.01 | 30.42 | 0.854 | 0.130 | 120.36 | ||
UmlpNet (ours) | 28.11 | 0.804 | 0.185 | 82.46 | 30.00 | 0.812 | 0.199 | 97.02 | 29.60 | 0.834 | 0.147 | 120.50 | ||
PRTD (ours) | 27.79 | 0.792 | 0.172 | 76.97 | 29.55 | 0.799 | 0.193 | 96.85 | 29.19 | 0.816 | 0.146 | 116.88 | ||
PRTD-SA (ours) | 28.48 | 0.808 | 0.193 | 83.53 | 30.24 | 0.815 | 0.201 | 100.02 | 29.86 | 0.839 | 0.150 | 119.81 | ||
bThe best results are marked in bold black
Fig. 10 [Images not available. See PDF.]
Visual results of different denoising methods on CBSD68 with noise levels of 25. Our approach effectively reinstates the fine details of the chicken's feathers
Fig. 11 [Images not available. See PDF.]
Visual results of different denoising methods on McMaster with noise levels of 50. Our method produces perceptually sharper images, skillfully restoring wall details that tend to be smoothed out by alternative methods
Real image denoising
For denoising real-world noisy images, we train all of our networks using the SIDD [58] dataset. The test results are compared with those of the current methods including DnCNN [12], MLP [28], BM3D [1], WNNM [4], CBDNet [59], RIDNet [60], DANet+ [61], CycleISP [62], MPRNet [63], MIRNet [65] and Uformer [64]. Table 3 presents the average PSNR, SSIM, LPIPS, and FID results of different methods on SIDD dataset for comparison. It is evident that our proposed method surpasses all other methods in terms of perceptual metrics. In Fig. 12, we present a comparison of the denoising results with other methods, where our proposed method shows a better recovery of details at the edges and in dark areas. However, since the real noise dataset used for training is limited, the method may not be as visually effective as in Gaussian noise denoising. Nevertheless, the method still proves to be effective in terms of perceptual quality.
Table 3. PSNR, SSIM, LPIPS and FID results of different methods on SIDD dataset
Method | DnCNN [12] | MLP [28] | BM3D [1] | WNNM [4] | CBDNet [59] | RIDNet [60] | DANet + [61] |
|---|---|---|---|---|---|---|---|
PSNR↑ | 23.66 | 24.71 | 25.65 | 25.78 | 30.78 | 38.71 | 39.47 |
SSIM↑ | 0.583 | 0.641 | 0.685 | 0.809 | 0.801 | 0.951 | 0.957 |
LPIPS↓ | – | – | – | – | – | 0.221 | 0.210 |
FID↓ | – | – | – | – | – | 63.82 | 49.57 |
Method | CycleISP [62] | MPRNet [63] | MIRNet [65] | Uformer [64] | UmlpNet (ours) | PRTD (ours) | PRTD_SA (ours) |
|---|---|---|---|---|---|---|---|
PSNR↑ | 39.52 | 39.71 | 39.72 | 39.77 | 39.65 | 39.07 | 39.71 |
SSIM↑ | 0.957 | 0.958 | 0.959 | 0.959 | 0.958 | 0.915 | 0.958 |
LPIPS↓ | 0.210 | 0.203 | 0.202 | 0.202 | 0.202 | 0.157 | 0.203 |
FID↓ | 51.98 | 49.55 | 47.72 | 47.19 | 45.98 | 32.87 | 48.23 |
cThe best results are marked in bold black
Fig. 12 [Images not available. See PDF.]
Visual results of different denoising methods on SIDD. Our method steers clear of an overly smooth output and instead excels in achieving a more precise reconstruction of detailed textures
Ablation studies
We conduct ablation experiments on face images with a noise level of 25 to validate the effectiveness of our proposed method. Next, we analyze three aspects of the proposed method: initial prediction consistency, residual diffusion and network complexity.
We, respectively, use the noisy images, non-local means (NLM) [66] denoised images, and structure preserved network (UmlpNet) result images into the residual diffusion as conditions, with identical experimental settings. The corresponding results are presented in Table 4. Our findings suggest that this method outperforms the other two in terms of both perceptual and distortion metrics, owing to the superior conditions employed. The three conditions can be interpreted as a completely degraded image, an over-smoothed image with some remaining noise, and an over-smoothed image without any noise. As shown in Fig. 13, conditioning the residual diffusion process with noisy images and NLM denoised images may lead to incomplete recovery of certain fine details, which may require longer training time. Therefore, the higher the consistency of the conditional images within the same training time is, the better the quality of images generated by residual diffusion in terms of their structure and detailed texture. The performance of the entire method does not rely solely on residual diffusion to achieve optimal results. The output of the structure preserved network is also critical, and while the initial predicted results may lack details, they are essential in that they exhibit a reasonable degree of image consistency to minimize any discrepancy with the actual image. The experimental results presented above demonstrate that our proposed UmlpNet performs exceptionally well, achieving high average PSNR and SSIM values in both Gaussian noise denoising and real noise denoising. As a result, PRTD guarantees structural consistency in the initial stage, significantly reducing the gap between the conditional image and the target image. This not only enhances the recovery of fine details but also reduces the number of required diffusion iterations.
Table 4. Results of residual diffusion model in different conditions on CelebA-HQ with noise levels of 25
Condition | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
Noisy image | 32.33 | 0.890 | 0.055 | 18.85 |
Denoised image by NLM | 31.65 | 0.875 | 0.063 | 21.95 |
Denoised image by UmlpNet | 33.59 | 0.907 | 0.049 | 7.99 |
dThe best results are marked in bold black
Fig. 13 [Images not available. See PDF.]
Visual results of residual diffusion model in different conditions on CelebA-HQ with noise levels of 25. a, e Reference. b Noisy image. c NLM result. d UmlpNet result. f–h are the respective results of b–d as the residual diffusion conditions differ
We separately apply the second-stage conditional diffusion model to clean and residual images, and the results are presented in Table 5. The use of residual diffusion exhibits better consistency and realism, while also allowing for the extraction of more detailed features.
Table 5. Results of conditional diffusion model in different initial images on CelebA-HQ with noise levels of 25
Initial image | PSNR | SSIM | LPIPS | FID |
|---|---|---|---|---|
Clean image | 32.71 | 0.897 | 0.041 | 11.20 |
Residual image | 33.59 | 0.907 | 0.049 | 7.99 |
eThe best results are marked in bold black
The network complexity is also a critical factor affecting the computational cost. Therefore, we have reduced the complexity of both the structure preserved network and the denoising network of residual diffusion. In the first stage, the structure preserved network only needs to ensure structural consistency and does not require a highly time-consuming denoising network that produces very high distortion indexes. Therefore, instead of using the image denoising network with high computational cost (e.g. SwinIR), we propose UmlpNet to obtain the structure-preserving initial prediction with a lower computational budget. Table 6 displays the computational cost of UmlpNet in comparison to other methods. It is apparent that our structure preserved network achieves lower computational costs. Furthermore, we also demonstrate the inferential timing of the second-stage residual diffusion model and PRTD at .
Table 6. MACs of different methods
Method | DnCNN | ADNet | DRUNet | MPRNet | Uformer | SwinIR | UmlpNet | Residual diffusion model (T = 100) | PRTD (T = 100) |
|---|---|---|---|---|---|---|---|---|---|
MACs (G) | 37 | 35 | 80 | 588 | 89 | 759 | 18 | – | – |
Parameters (M) | 0.56 | 0.52 | 19.64 | 15.47 | 50.88 | 11.5 | 35.54 | – | – |
Inference time (s) | 0.049 | 0.045 | 0.075 | 0.081 | 0.050 | 0.559 | 0.043 | 0.893 | 0.936 |
In Table 7, we present the inference times for the second stage of the residual diffusion model, showcasing sampling steps ranging from 100 to 500. Notably, it is evident that the sampling time escalates with an increase in the number of sampling steps. As highlighted in Sect. 5.1, our evaluation revealed minimal discrepancy in the perceptual quality of denoising outcomes between 100 and 150 steps. Consequently, to optimize inference time, we opted for .
Table 7. Inference time in second-stage residual diffusion models with different sampling steps
Step | T = 100 | T = 150 | T = 200 | T = 300 | T = 500 |
|---|---|---|---|---|---|
Inference time (s) | 0.893 | 1.523 | 1.794 | 2.504 | 3.642 |
In the second stage, we conducted experiments to investigate key modules to reduce complexity and computational costs of denoising network. The experimental results for each module are presented in Table 8. During the experiment, we discovered that when using GroupNorm (group normalization) [67], improper settings of the number of channels per group can lead to significant color bias. Moreover, it is found that using only group normalization or attention mechanisms can significantly improve the perceptual metrics, such as LPIPS and FID, but it may also require longer training time. If both were used at the same time, it may lead to color bias and thus resulting in a lower PSNR. Therefore, it is important to strike a balance between distortion quality and perceived quality to achieve optimal results. Finally, we choose to remove the GroupNorm and attention mechanism to reduce the network complexity. However, it is found that using only one residual block leads to residual noise spots in the result map, causing incomplete noise removal. Therefore, we choose to use two residual blocks to ensure better denoising performance.
Table 8. Results for each module on CelebA-HQ with noise levels of 25
Module | Metrics | |||||
|---|---|---|---|---|---|---|
Resblock | Attention | GroupNorm | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ |
1 | No | No | 32.32 | 0.871 | 0.040 | 16.13 |
2 | No | No | 32.71 | 0.895 | 0.042 | 13.29 |
2 | Yes | No | 30.47 | 0.869 | 0.046 | 12.66 |
2 | No | Yes | 30.56 | 0.863 | 0.042 | 12.65 |
2 | Yes | Yes | 23.93 | 0.810 | 0.053 | 13.65 |
Conclusion
We propose a two-stage image denoising method, which aims to effectively recover high-frequency detailed information lost due to the point-by-point denoising model while also preserving the main structure of the image, thus enhancing the visual perceptual quality. To achieve this goal, we have made innovative improvements on both deep neural networks and conditional diffusion models. Specifically, we introduce an initial prediction into the residual diffusion model to more accurately estimate the distribution of clean images. This constraint reduces the discrepancy with clean images and effectively reduces the sampling cost. As a result, the denoised images not only exhibit structural consistency, but also maintain realistic to details, even at high noise levels. The denoising experiments conducted on various datasets demonstrate that the proposed method yields images with clear structure and textures, resulting in a significant improvement in metrics closely aligned with human perception.
Funding
This study was funded by the National Natural Science Foundation of China (Grant No. 62061049, Grant No. 12263008), the Yunnan Provincial Department of Science and Technology–Yunnan University Joint Special Project for Double-Class Construction (Grant No. 202201BF070001-005) and the Practical Innovation Fund Project for Professional Degree Graduate Students of Yunnan University (Grant No. ZC-22221881).
Data availability
All data generated or analyzed during this study are included in this published article.
Declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Dabov, K; Foi, A; Katkovnik, V; Egiazarian, K. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process.; 2007; 16,
2. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Non-local sparse models for image restoration. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2272–2279 (2009). https://doi.org/10.1109/ICCV.2009.5459452
3. Dong, W; Zhang, L; Shi, G; Li, X. Nonlocally centralized sparse representation for image restoration. IEEE Trans. Image Process.; 2012; 22,
4. Gu, S., Zhang, L., Zuo, W., Feng, X.: Weighted nuclear norm minimization with application to image denoising. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2862–2869 (2014).
5. Schmidt, U., Roth, S.: Shrinkage fields for effective image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2774–2781 (2014).
6. Chen, Y., Yu, W., Pock, T.: On learning optimized reaction diffusion processes for effective image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5261–5269 (2015).
7. Goodfellow, I; Pouget-Abadie, J; Mirza, M; Xu, B; Warde-Farley, D; Ozair, S; Courville, A; Bengio, Y. Generative adversarial networks. Commun. ACM; 2020; 63,
8. Divakar, N., Venkatesh Babu, R.: Image denoising via CNNs: An adversarial approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 80–87 (2017).
9. Alsaiari, A., Rustagi, R., Thomas, M. M., Forbes, A. G.: Image denoising using a generative adversarial network. In: 2019 IEEE 2nd International Conference on Information and Computer Technologies (ICICT), pp. 126–132 (2019). https://doi.org/10.1109/INFOCT.2019.8710893
10. Ho, J; Jain, A; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst.; 2020; 33, pp. 6840-6851.
11. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265 (2015).
12. Zhang, K; Zuo, W; Chen, Y; Meng, D; Zhang, L. Beyond a gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process.; 2017; 26,
13. Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., Yang, M. H.: Restormer: efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5728–5739 (2022).
14. Ohayon, G., Adrai, T., Vaksman, G., Elad, M., Milanfar, P.: High perceptual quality image denoising with a posterior sampling cgan. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1805–1813 (2021).
15. Chen, J., Chen, J., Chao, H., Yang, M.: Image blind denoising with generative adversarial network based noise modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3155–3164 (2018).
16. Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. arXiv preprint arXiv: https://arxiv.org/abs/1609.03499 (2016).
17. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A.: Conditional image generation with pixelcnn decoders. In: Advances in Neural Information Processing Systems, vol. 29 (2016).
18. Prakash, M., Krull, A., Jug, F.: Fully unsupervised diversity denoising with convolutional variational autoencoders. arXiv preprint arXiv: https://arxiv.org/abs/2006.06072 (2020).
19. Lugmayr, A., Danelljan, M., Timofte, R.: NTIRE 2021 learning the super-resolution space challenge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 596–612 (2021).
20. Lugmayr, A., Danelljan, M., Van Gool, L., Timofte, R.: Srflow: learning the super-resolution space with normalizing flow. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 715–732. Springer International Publishing (2020).
21. Özdenizci, O; Legenstein, R. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE Trans. Pattern Anal. Mach. Intell.; 2023; [DOI: https://dx.doi.org/10.1109/TPAMI.2023.3238179]
22. Whang, J., Delbracio, M., Talebi, H., Saharia, C., Dimakis, A. G., Milanfar, P.: Deblurring via stochastic refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16293–16303 (2022).
23. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv: https://arxiv.org/abs/2010.02502 (2020).
24. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv: https://arxiv.org/abs/2206.00927 (2022).
25. Luo, Z., Gustafsson, F. K., Zhao, Z., Sjölund, J., Schön, T. B.: Image Restoration with Mean-Reverting Stochastic Differential Equations. arXiv preprint arXiv: https://arxiv.org/abs/2301.11699 (2023).
26. Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., Chan, W.: Wavegrad: estimating gradients for waveform generation. arXiv preprint arXiv: https://arxiv.org/abs/2009.00713 (2020).
27. Chung, H., Sim, B., Ye, J. C.: Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12413–12422 (2022).
28. Burger, H. C., Schuler, C. J., Harmeling, S.: Image denoising: can plain neural networks compete with BM3D?. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2392–2399 (2012). https://doi.org/10.1109/CVPR.2012.6247952
29. Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxim: multi-axis mlp for image processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5769–5780 (2022).
30. Valanarasu, J. M. J., Patel, V. M.: Unext: Mlp-based rapid medical image segmentation network. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V, pp. 23–33. Springer, Cham (2022).
31. Liu, H; Dai, Z; So, D; Le, QV. Pay attention to mlps. Adv. Neural. Inf. Process. Syst.; 2021; 34, pp. 9204-9215.
32. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018).
33. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv: https://arxiv.org/abs/1809.11096 (2018).
34. Saharia, C; Ho, J; Chan, W; Salimans, T; Fleet, DJ; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell.; 2022; [DOI: https://dx.doi.org/10.1109/TPAMI.2022.3204461]
35. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv: https://arxiv.org/abs/1711.05101 (2017).
36. Loshchilov, I., Hutter, F.: Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv: https://arxiv.org/abs/1608.03983 (2016).
37. Zhang, R., Isola, P., Efros, A. A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018).
38. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, 30 (2017).
39. Wang, Z; Bovik, AC; Sheikh, HR; Simoncelli, EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.; 2004; 13,
40. Menon, S., Damian, A., Hu, S., Ravi, N., Rudin, C.: Pulse: self-supervised photo upsampling via latent space exploration of generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2445 (2020).
41. Blau, Y., & Michaeli, T. The perception-distortion tradeoff. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6228–6237 (2018).
42. Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: dataset and study. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 126–135 (2017).
43. Arbelaez, P; Maire, M; Fowlkes, C; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.; 2010; 33,
44. Ma, K; Duanmu, Z; Wu, Q; Wang, Z; Yong, H; Li, H; Zhang, L. Waterloo exploration database: new challenges for image quality assessment models. IEEE Trans. Image Process.; 2016; 26,
45. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019).
46. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv: https://arxiv.org/abs/1710.10196 (2017).
47. Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3929–3938 (2017).
48. Zhang, K; Zuo, W; Zhang, L. FFDNet: toward a fast and flexible solution for CNN-based image denoising. IEEE Trans. Image Process.; 2018; 27,
49. Tian, C; Xu, Y; Li, Z; Zuo, W; Fei, L; Liu, H. Attention-guided CNN for image denoising. Neural Netw.; 2020; 124, pp. 117-129. [DOI: https://dx.doi.org/10.1016/j.neunet.2019.12.024]
50. Zhang, K; Li, Y; Zuo, W; Zhang, L; Van Gool, L; Timofte, R. Plug-and-play image restoration with deep denoiser prior. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 44,
51. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: image restoration using swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1833–1844 (2021).
52. Zhang, K; Li, Y; Liang, J; Cao, J; Zhang, Y; Tang, H; Gool, LV. Practical blind image denoising via Swin-Conv-UNet and data synthesis. Mach. Intell. Res.; 2023; 20,
53. Timofte, R., Agustsson, E., Van Gool, L., Yang, M. H., Zhang, L.: Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 114–125 (2017).
54. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings 8th IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, IEEE pp. 416–423 (2001).
55. Franzen, R.: Kodak lossless true color image suite. source: http://r0k.us/graphics/kodak, 4(2) (1999).
56. Zhang, L; Wu, X; Buades, A; Li, X. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. J. Electron. Imaging; 2011; 20,
57. Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., et al.: Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12299–12310 (2021).
58. Abdelhamed, A., Lin, S., Brown, M. S.: A high-quality denoising dataset for smartphone cameras. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1692–1700 (2018).
59. Guo, S., Yan, Z., Zhang, K., Zuo, W., Zhang, L.: Toward convolutional blind denoising of real photographs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1712–1722 (2019).
60. Anwar, S., Barnes, N.: Real image denoising with feature attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3155–3164 (2019).
61. Yue, Z., Zhao, Q., Zhang, L., Meng, D.: Dual adversarial network: toward real-world noise removal and noise generation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pp. 41–58. Springer (2020).
62. Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., Yang, M. H., Shao, L.: Cycleisp: real image restoration via improved data synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2696–2705 (2020)
63. Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., Yang, M. H., Shao, L.: Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14821–14831 (2021).
64. Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: a general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022).
65. Zamir, SW; Arora, A; Khan, S; Hayat, M; Khan, FS; Yang, MH; Shao, L. Learning enriched features for fast image restoration and enhancement. IEEE Trans. Pattern Anal. Mach. Intell.; 2022; 45,
66. Buades, A., Coll, B., & Morel, J. M. A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 60–65 (2005).
67. Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018).
Copyright Springer Nature B.V. Jan 2025