An Efficient Super-Resolution Network Based on

Full text

Turn on search term navigation

1. Introduction

In recent years, single image super-resolution (SISR) has attracted a lot of attention from researchers in the field of computer vision. SISR aims to reconstruct a high-resolution image I^HR from a single low-resolution image I^LR [1], and it has been widely used in many fields, such as remote sensing [2], medical imaging [3], and environmental monitoring [4,5,6,7]. To our knowledge, the interpolation technique based on sampling theory was the earliest method to solve the super-resolution problem. However, there are serious shortages in predicting details and realistic textures. To address this problem, techniques that learn the mapping relationship between I^LR and I^HR have been proposed, such as neighbor embedding [8,9,10,11] and sparse coding [12,13,14,15,16]. In the last few years, deep learning-based approaches for super-resolution are constantly emerging [16,17,18,19,20]. Dong et al. first applied CNN (convolutional neural networks) into super-resolution [18], with a satisfactory effect in its practical use. Later, Kim et al. designed SRResNet (residual network for super-resolution) [20] based on the well-known residual network ResNet [19]. Benefiting from the jump connection and recursive structure, deeper layers are easy to realize for better performance. To simplify SRResNet, enhanced deep super-resolution network (EDSR) [1] was proposed for super-resolution by Lim et al., which optimizes the architecture of residual blocks by removing unnecessary modules. Although these ResNet-based models can improve the quality of reconstruction due to deeper layers, they all met the same problem: a sharp increase in the number of parameters. Especially in engineering practice, the cost of a large number of residual blocks and parameters has hampered the wider use of ResNet-based models. Therefore, the question of how to reduce the number of model parameters without reconstruction quality loss has become one of the hottest research issues.

Nowadays, there are various methods reported to reduce the number of parameters [21,22,23,24]. Network pruning, SVD (singular value decomposition), and split–transform–merge strategy are three representative methods. In 1990, LeCun et al. first proposed the concept of network pruning, which decreased the model size by cutting off the redundant parameters of the neural network [21]. This method requires a lot of iterative training to ensure network performance. In 2014, Denton et al. proposed the SVD method to reduce the number of weights [22]. In the SVD method, the complex matrix is represented by multiplying smaller and simpler submatrices, which can significantly reduce network parameters. However, with the increase of the matrix scale, the calculation of the singular value becomes complicated and difficult. In recent years, the split–transform–merge strategy attracted more and more attention from researchers. Based on this strategy, the Inception models were developed with less computational complexity and a fewer number of parameters [23]. In the Inception models, the input is split into several low-dimensional embeddings (by 1×1 convolutions), then converted through a set of specialized filters (3×3, 5×5, etc.) and finally merged by connection [24]. However, because the hyper-parameters of each branch need to be set properly, it is hard to find a simple design method for the construction of an Inception network. In 2016, Xie et al. proposed the ResNeXt [24] network based on aggregated transformations, which can be regarded as the improvement of the split–transform–merge strategy. However, the ResNeXt was originally designed for image classification, therefore, its structure must be changed and optimized when applying it to super-resolution.

In this paper, an efficient multibranch residual network for the super-resolution task is proposed. The multibranch architecture is built on the basis of aggregated transformations. In the meantime, we optimize the residual block with reference to EDSR. According to the proposed network structure, two specific models are established and given as examples in this work. Experiments show that our models can achieve a good reconstruction quality with a significant reduction of network parameters.

2. Related Work

Inception: The Inception network is a typical multibranch architecture based on the split-transform-merge strategy. Each branch in the network is carefully designed to gain good performance in terms of speed and accuracy. However, the customized size and number of each filter in the branch make the Inception network hard to implement.

SRResNet: SRResNet is a super-resolution reconstruction network which is inspired by the residual network [20]. Based on the original residual structure, the network removes the active layer after the residual block and obtains a good image reconstruction result in human vision.

EDSR: EDSR is a state-of-the-art super-resolution network which further modifies the residual block structure based on SRResNet [1]. Since BN (batch normalization) layers get rid of the range flexibility from networks and consume a lot of memory, EDSR removes two BN layers in the residual block. Benefiting from the structural modification, EDSR has great improvements in image reconstruction and reduction in the usage of graphics processing unit (GPU) memory.

ResNeXt: Based on the residual block architecture, ResNeXt exploits the split–transform–merge strategy in an easy, extensible way—namely, aggregated residual transformations [24]. This method involves stacking a series of homogeneous, multibranch residual blocks with only a few hyperparameters to set [24]. Branches of ResNeXt each preform their set of convolutions and merge at the end of the block. Compared with ResNet, ResNeXt shows better performance and less computation complexity in the task of image classification.

Grouped convolution: Grouped convolution was first proposed in the AlexNet paper [25] in 2012. The given motivation by the author was to distribute the model over two GPUs to solve the limited hardware resources of a single GPU. Grouped convolution divides the feature maps into multiple GPUs for convoluting and subsequently aggregates the obtained results of multiple GPUs.

3. Methods

EDSR has achieved good results in the super-resolution field, but there is little improvement on the parameter quantity compared with other algorithms. To reduce the number of parameters, the aggregation transformation method is applied to EDSR in this paper. The aggregation transformation method, by which the multibranch architecture of networks can be built in an easy way, is originally presented in ResNeXt. This method can reduce the parameter and time complexity without significantly decreasing the accuracy of image classification.

A simple and obvious way to directly transform EDSR into multibranch architecture is by the aggregation transformation method. However, the original residual block of EDSR with two convolution layers is inconsistent with the aggregation transformation method [24]. This direct transformation would result in a wild and dense model, which not only has no benefit but adds more complexity. To solve this issue, we must redesign the model with multibranch architecture. Three or more convolution layers are required in the residual block of the new model. To simplify the structure of the residual block and enhance the feature extraction capability, we adopted three convolution layers in this work. Compared with the original residual block as shown in Figure 1a, our rebuilt residual block removes the unnecessary rectified linear unit (ReLU) and BN layers with reference to the EDSR structure. This removal operation helps improve the performance of image reconstruction.

As shown in Figure 1, the convolutional layer (Conv) was used to perform feature extraction, and ReLU to rectify the network output. The BN layer was used to normalize the features, and Addition represents the additional layer that the network adds as needed.

It is also known from the experiment by Lim et al. [1] that increasing the number of feature maps above a certain level would make the training process numerically unstable. The typical solution is to place a constant scaling layer (also called as MulConstant layer) after the last convolutional layer of each residual block. Owing to the use of aggregation transformations, the number of feature maps per convolution layer can be significantly reduced in comparison with the original EDSR model, therefore, the model proposed in this paper does not require the constant scaling layer. From the results in the following Experiment section, we can see that adding a constant scaling layer could worsen the performance. After removing the constant scaling layer, the architecture of our multibranch network is modeled and shown in Figure 2. The detailed description of ResBlock (residual block) has been given in Figure 1c. Upsample (upsampling structure) can magnify the image to the desired multiple.

As shown in Figure 3, we design with different configurations for our multibranch architecture: EDSRSP-3×3 and EDSRSP-1×1. The number represents the size of the first and third convolution kernel. The configuration of the residual block in EDSRSP-3×3 is as the same as that in EDSR, i.e., 3×3 convolution kernel, 256-d input and 256-d output. It is seen from Table 1 that the number of parameters in EDSRSP-3×3 is reduced by 1/3 compared with EDSR. To further decrease the parameters, the configuration of EDSRSP-1×1 is properly adjusted and shown in Figure 3b. The detailed adjustments include using the 1 × 1 convolution kernel in the first and third layers and the 512-d input and output in the second layer. EDSRSP-1×1 is similar to the bottleneck structure of ResNet, only with a little modification on the output dimension in the first layer. Due to the use of a 1 × 1 convolution kernel, the number of parameters in EDSRSP-1×1 are reduced to 1/4 of those in EDSR.

For the implementation of aggregation transformation, our model has two equivalent structures as shown in Figure 4. The two structures have the same-level reconstruction performance, but the structure based on group convolution (Figure 4b) has the distinct advantages of time complexity and memory usage. Therefore, we use group convolution to realize the aggregation transformation.

4. Experiment

4.1. Datasets

For our experiment, the newly proposed Diverse 2K (DIV2K) dataset [26] is used due to its high-quality (2K) resolution for the image reconstruction tasks. The DIV2K dataset consists of 800 training images, 100 validation images, and 100 test images. Since the test dataset ground truth has not been published, the performance comparison was made on the validation dataset. We also compared the performance on three standard benchmark datasets: Set5 [9], Set14 [12], and B100 [27].

4.2. PSNR and SSIM Criteria

Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are the two most-used indicators in the field of super-resolution reconstruction, which can measure the similarity between the reconstructed image and the original high-resolution image [28,29]. The mathematical expression of PSNR is as follows:

(1) $PSNR = 10 l o g [\frac{{(2^{n} - 1)}^{2}}{MSE}],$

where

n

is the number of bits per pixel, and mean square error (MSE) is defined as shown below:

(2) $MSE = \frac{1}{M N} \sum \sum {[f (i, j) - f^{'} (i, j)]}^{2},$

where

f (i, j)

and

f^{'} (i, j)

represent the original and reconstructed images, respectively. Both of them are of size

M \times N

, and (i, j) stands for the pixel coordinate. The larger the value of PSNR, the better effect of image reconstruction.

SSIM is another popular criteria to compare the reconstructed image x and the original high-definition image y. The formula of SSIM is shown as follows:

(3) $SSIM (x, y) = \frac{(2 u_{x} u_{y} + c_{1}) (2 σ_{x y} + c_{2})}{({u_{x}}^{2} + {u_{y}}^{2} + c_{1}) ({σ_{x}}^{2} + {σ_{y}}^{2} + c_{2})},$

where

u_{x}, u_{y}

are the mean value of

x

y

{σ_{x}}^{2}, {σ_{y}}^{2}

are the variance of

x

y

σ_{x y}

is the covariance of

x

and

y

c_{1} = {(k_{1} L)}^{2}

and

c_{2} = {(k_{2} L)}^{2}

are constants to maintain formula validity, avoiding the denominator being zero.

L

represents the dynamic range of the pixel value.

k_{1} = 0.01

and

k_{2} = 0.03

by default. The larger the value of SSIM, the better similarity of the two images.

4.3. Training Details

For training, we use and adjust the training parameters given in Lim et al. [1]. Neither the pre-training model nor the geometric self-ensemble strategy is used in this training. The chop size is set to 4.0 × 10⁴ and patch sizes of ×3/×4 were set to 96. We also learnt from the code published by the EDSR paper and trained the models by using NVIDIA Titan Xp GPUs. According to the official baseline model, the used EDSR model is retrained with no modifications other than those mentioned above. It takes seven days to train EDSR compared with three days for our models.

4.4. Comparison between the Cases with and without MulConstant Layer

To analyze the effect of the MulConstant Layer in our designed residual block, we performed experiments on the EDSRSP-1 × 1 × 4 model and the EDSRSP-3 × 3 × 2 model. The three experiments correspond to three different cases: (1) without the MulConstant layer; (2) MulConstant layer with the factor set to 0.1; (3) MulConstant layer with the factor set to 0.01. From the experimental results as shown in Figure 5, we can see that removing the MulConstant layer in our model results in better performance.

4.5. Evaluation on DIV2K Dataset

For the performance evaluation, a comparison between the retrained EDSR model and our model is made and shown in Figure 6. The detailed evaluation method is given and described in Lim et al. [1]. Using PSNR and SSIM criteria, the evaluation is conducted on 10 images of the DIV2K validation set. Concretely, we use full RGB channels and ignore the (6 + scale) pixels from the border. The small difference between EDSR and our models could verify the performance of the proposed method.

Table 2 gives PSNR and SSIM scores of EDSR and our models on the DIV2K validation set, where the results are consistent with those in Figure 6. In addition, visual comparisons of the super-resolution images are shown in Figure 7. It can be seen, intuitively, that our models show high quality regardless of details or textures.

We also performed the running time test on the pictures in Figure 7. The experimental results are shown in Table 3. As can be seen from the data in the table, the proposed model has a faster running time than EDSR.

4.6. Evaluation on Other Datasets

More experiments were implemented on the standard datasets of B100, Set5, and Set14. For comparison, we measured PSNR and SSIM on the y-channel, ignoring the same number of pixels as the boundary scaling. The MATLAB code was provided by the EDSR paper for this evaluation. As can be seen from Table 4, our models can achieve the same level performance as EDSR with fewer parameters, in theory.

It can be seen from the experimental results that under the premise of ensuring the reconstruction quality, the proposed models have obvious advantages in time complexity and space complexity. This also means a reduction in the demand for hardware resources in practical applications, which makes our models easier to implement in real conditions.

5. Conclusions

In this paper, we propose an efficient super-resolution network based on aggregated residual transformations. Based on the proposed network, two specific models were designed and built in this work. Each of the two models has its own advantages regarding the reconstruction performance and the number of parameters. Experiments on both the DIV2K and other standard datasets were implemented to evaluate the performance of our network. The experiment results proved that our method is effective and easy to implement. Compared with EDSR, the number of parameters is significantly reduced with the same-level performance.

Author Contributions

Conceptualization, G.Z.; Methodology, G.Z.; Investigation, G.Z. and H.W.; Writing—Original Draft, G.Z. and W.Z.; Writing—Review & Editing, G.Z. and Y.L.; Project Administration, G.Z. and H.W.; Supervision, H.W. and Y.L.; Software, M.Z. and H.Q.

Funding

This research is supported by the China Postdoctoral Science Foundation No. 2018M633471 and the AeroSpace T.T. and C. Innovation Program.

Conflicts of Interest

The authors declare no conflict of interest.

Figures and Tables

View Image - Figure 1. Comparison of residual blocks in the original ResNet, enhanced deep super-resolution network (EDSR), and our model. (a) Original ResNet residual block; (b) EDSR residual block; (c) Our proposed residual block.

Figure 1. Comparison of residual blocks in the original ResNet, enhanced deep super-resolution network (EDSR), and our model. (a) Original ResNet residual block; (b) EDSR residual block; (c) Our proposed residual block.

Figure 2. The architecture of the proposed multibranch network.

Figure 3. Proposed models. (a) EDSRSP-3×3. (b) EDSRSP-1×1.

View Image - Figure 4. Equivalent building blocks of EDSRSP-1×1. (a) Aggregated residual transformations. (b) A block equivalent to (a), implemented as grouped convolutions.

Figure 4. Equivalent building blocks of EDSRSP-1×1. (a) Aggregated residual transformations. (b) A block equivalent to (a), implemented as grouped convolutions.

Figure 5. (a) Peak signal-to-noise ratio (PSNR) validation of EDSRSP-1×1 ×4 models. (b) PSNR validation of EDSRSP-3×3 ×2 models.

View Image - Figure 6. (a) Validation PSNR of EDSR ×2 model and proposed ×2 models. (b) Validation PSNR of EDSR ×3 model and proposed ×3 models. (c) Validation PSNR of EDSR x4 model and proposed ×4 models.

Figure 6. (a) Validation PSNR of EDSR ×2 model and proposed ×2 models. (b) Validation PSNR of EDSR ×3 model and proposed ×3 models. (c) Validation PSNR of EDSR x4 model and proposed ×4 models.

Figure 7. Super-resolution reconstruction results on the DIV2K dataset.

Table 1

Parameters of EDSR and our models.

Model	Number of Residual Blocks	Total Parameters of Residual Blocks
EDSR	32	~32,749 K
256, 3 × 3, 256
256, 3 × 3, 256
EDSRSP-3×3	21	~25,160 K
256, 3 × 3, 256
256, 3 × 3, 256, 32
256, 3 × 3, 256
EDSRSP-1×1	21	~7053 K
256, 1 × 1, 512
512, 3 × 3, 512, 32
512, 1 × 1, 256

Table 2

Performance comparison between architectures on the DIV2K validation set (PSNR (dB)/SSIM).

Dataset	Scale	EDSR	EDSRSP-3×3	EDSRSP-1×1
DIV2K	×2	35.80/0.9676	35.71/0.9673	35.60/0.9670
	×3	32.17/0.9345	32.06/0.9337	31.99/0.9331
	×4	30.07/0.9057	29.97/0.9050	29.88/0.9045

Table 3

Running time (s) comparison between EDSR and proposed models.

Scale	EDSR	EDSRSP-3×3	EDSRSP-1×1
×2	12.562	9.966	6.472
×3	7.700	6.348	4.665
×4	4.426	3.363	2.442

Table 4

Public benchmark test results (PSNR (dB)/SSIM).

Dataset	Scale	EDSR	EDSRSP-3×3	EDSRSP-1×1
Set5	×2	38.08/0.960	38.04/0.9599	37.99/0.9598
	×3	34.59/0.9275	34.48/0.9267	34.40/0.9261
	×4	32.36/0.8950	32.21/0.8937	32.15/0.8926
Set14	×2	33.71/0.9185	33.65/0.9180	33.58/0.9169
	×3	30.35/0.8435	30.32/0.8428	30.24/0.8412
	×4	28.60/0.7831	28.57/0.7821	28.51/0.7809
B100	×2	32.30/0.9009	32.24/0.9004	32.20/0.8995
	×3	29.20/0.8080	29.16/0.8067	29.12/0.8055
	×4	27.64/0.7390	27.60/0.7378	27.57/0.7366

Word count: 2834

Show less

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

In this paper, we propose an efficient multibranch residual network for single image super-resolution. Based on the idea of aggregated transformations, the split-transform-merge strategy is exploited to implement the multibranch architecture in an easy, extensible way. By this means, both the number of parameters and the time complexity are significantly reduced. In addition, to ensure the high-performance of super-resolution reconstruction, the residual block is modified and simplified with reference to the enhanced deep super-resolution network (EDSR) model. Moreover, our developed method possesses advantages of flexibility and extendibility, which are helpful to establish a specific network according to practical demands. Experimental results on both the Diverse 2K (DIV2K) and other standard datasets show that the proposed method can achieve a good performance in comparison with EDSR under the same number of convolution layers.

Details

Title

An Efficient Super-Resolution Network Based on Aggregated Residual Transformations

Author

Liu, Yan¹

; Zhang, Guangrui¹; Wang, Hai²; Zhao, Wei¹; Zhang, Min²

; Qin, Hongbo¹

¹ Key Laboratory of Electronic Equipment Structure Design, Ministry of Education, Xidian University, Xi’an 710071, China
² School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China

First page

339

Publication year

2019

Publication date

2019

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics8030339

ProQuest document ID

2548382574

An Efficient Super-Resolution Network Based on Aggregated Residual Transformations

Jump to:

Full text

Abstract

Details

Suggested sources