Full Text

Turn on search term navigation

1. Introduction

Remote sensing satellites are important tools for the monitoring of processes such as vegetation and land cover changes on the earth surface [1,2,3]. Because of technological limitations in sensor designs [4], compromises have to be made between spatial and temporal resolutions. For example, Moderate Resolution Imaging Spectroradiometer (MODIS) can visit the earth once a day with 500-m spatial resolution. As a comparison, the spatial resolution of Landsat Enhanced Thematic Mapper Plus (ETM+) is 30 m, but its revisiting period is 16 days. Such a limitation restricts the application of remote sensing in problems that need images high in both spatial and temporal resolutions. Spatiotemporal reflectance fusion models [5] have thus been developed to fuse image data from different sensors to obtain high spatiotemporal resolution images.

The Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM) [6] is a pioneering fusion model based on a weighting method. This model uses neighboring pixels to compute the center pixel at a point in time with a weighting function, and the weights are determined by spectral difference, temporal difference and location distance. Furthermore, Zhu et al. [7] proposed an Enhanced Spatial and Temporal Adaptive Reflectance Fusion Model (ESTARFM) based on a STARFM algorithm to predict the surface reflectance of heterogeneous landscapes. Another improved STARFM method is a Spatial Temporal Adaptive Algorithm for mapping Reflectance Change (STAARCH) [8], which is designed to detect disturbance and changes in reflectance by using Tasseled Cap transformations. However, performance of the weighting methods are constrained because linear combination smooths out the changing terrestrial contents.

Another type of reflectance fusion method, known as dictionary learning methods, has been proposed to overcome the shortcoming of the weighting methods. Dictionary-based methods that use certain known dictionaries, such as wavelets and shearlets, have been proved to be efficient in multisensor and multiresolution image fusion [9,10,11]. In remote sensing data analysis, Moigne et al. [12] and Czaja et al. [13] proposed remote sensing image fusion methods based on wavelets and wavelet packet, respectively. Shearlet transform is also used in a fusion algorithm in [14] because shearlets can share the same optimality properties and enjoy similar geometrical properties. Using the capability of dictionary learning and sparsity-based methods in the super resolution analysis, Huang et al. [15] proposed a Sparse-representation-based Spatiotemporal Reflectance Fusion Model (SPSTFM) to integrate sparse representation and reflectance fusion by establishing correspondences between structures in high resolution images and their corresponding low resolution images through dictionary pair and sparse coding. SPSTFM assumes that high and low resolution images of the same area have the same sparse coefficients. Such assumption is, however, too restrictive [16]. Based on this idea, Wu et al. [17] proposed the Error-Bound-regularized Semi-Coupled Dictionary Learning (EBSCDL) model which assumes that the representation coefficients of the image pair have a stable mapping and coefficients of the dictionary pair have perturbations in the reconstruction step. Attempts have been made to improve the performance of the SCDL based models. For examples, Block Sparse Bayesian Learning for Semi-Coupled Dictionary Learning (BSBL-SCDL) [18] employs the structural sparsity of the sparse coefficients as a priori knowledge and Compressed Sensing for Spatiotemporal Fusion (CSSF) [19] considers explicitly the down-sampling process within the framework of compressed sensing for reconstruction. In comparison with the weighting methods, the advantage of the dictionary-learning-based methods is that they retrieve the hidden relationship between image pairs from the sparse coding space to better capture structure changes.

Besides the aforementioned methods, some researchers employed other approaches to fuse multi-source data. Unmixing techniques have been suggested for spatiotemporal fusion because of their ability to reconstruct images with high spectral fidelity [20,21,22,23,24]. Considering the mixed-class spectra within a coarse pixel, Xu et al. [25] proposed the Class Regularized Spatial Unmixing (CRSU) model. This method is based on the conventional spatial unmixing technique but is modified to include prior class spectra estimated by the known image pairs. To provide a formal statistical framework for fusion, Xue et al. [26] proposed Spatiotemporal Bayesian Data Fusion (STBDF) that makes use of the joint distribution to capture implicitly the temporal changes of images for the estimation of the high resolution image at a target point in time.

Considering structure similarity in spectral bands, structure information has been employed in pan-sharpening and image fusion. Shi et al. [27], for example, proposed a learning interpolation method for pan-sharpening by expanding sketch information of the high-resolution panchromatic (PAN) image which contains the structure features of the PAN image. Glasner et al. [28] verified that many structures in a natural image are similar at the same and different scales. Inspired by this, a self-learning approach was proposed by Khateri et al. [29] which uses similar structures at different levels to pan-sharpen the low resolution multi-spectral images. In multi-modality image fusion, Zhu et al. [30] proposed a method which decomposes images into cartoon and texture components, and preserves the structure information of two components based on spatial-based method and sparse representation, respectively.

However, none of these spatiotemporal fusion methods consider the structure similarity between spectral bands in the fusion procedure. Although different bands have different reflectance ranges, the edge information is still similar [31]. Obviously, a reconstruction model can have a better performance if such information can be effectively used to predict the unknown high resolution image. Otherwise, the dictionary pair obtained by the training image pair are inefficient to predict the unknown images because of the lack of information for the target time. This can be explained from the experience in machine learning in which the _ℓ1 _ℓ1 norm is too restrictive in encoding the unknown data in the prediction process because it only uses the sparsity structure of the dictionary [32,33]. Therefore, the reconstruction model needs a replacement of the _ℓ1 _ℓ1norm to reduce the impact of insufficient information and to improve the representation ability of the dictionary pair.

We propose a new model in this paper to enhance spatiotemporal fusion performance. Our model uses the edge information in different bands via adaptive multi-band constraints to improve the reconstruction performance. To overcome the disadvantage of the _ℓ1 _ℓ1 norm, the nuclear norm is adopted as the regularization term to increase the efficiency of the learnt dictionary pair. Nuclear norm considers not only the sparsity but also the coordination in producing a suitable coefficient that can harmonize the sparse and collaborative representations adaptively [32,33].

Overall, the main contributions of this work can be summarized as follows.

1. The multi-band constraints are employed to reinforce the structure similarity of different bands in spatiotemporal fusion.

2. Considering the different structure similarity between two bands, the adaptive regularization parameters are proposed to determine the importance of each multi-band constraint adaptively.

3. The nuclear norm is employed to replace the _ℓ1 _ℓ1 norm in the reconstruction model because the nuclear norm considers both sparsity and correlation of the dictionaries and can overcome the disadvantage of the _ℓ1 _ℓ1norm.

The remainder of this paper is organized as follows. Our method for spatiotemporal fusion, called adaptive multi-band constraints fusion model (AMCFM), is proposed in Section 2. Section 3 discusses the experiments carried out to assess the effectiveness of the AMCFM and four state-of-the-art methods in terms of statistics and visual effects. We then conclude the paper with a summary and direction for future research in Section 4.

2. Methodology

2.1. Problem Definition

In the following, MODIS images are selected as low resolution images and Landsat ETM+ images are selected as high resolution images. As shown in Figure 1, our spatiotemporal fusion model requires three low resolution images _M1 _M1 , _M2 _M2 and _M3 _M3 , and two high resolution images _L1 _L1 and _L3 _L3 . The high resolution image _L2 _L2 is the target image that we want to predict. Let _Lij _Lij ( _Lij=_Li−_Lj _Lij=_Li−_Lj ) and _Mij _Mij ( _Mij=_Mi−_Mj _Mij=_Mi−_Mj ) be the high and low resolution difference images between _ti _ti and _tj _tj ( i,j∈{1,2,3} i,j∈{1,2,3} ), respectively. We assume that changes of remote sensing images between two points in time are linear. For effectiveness, the dictionary pair _Dl _Dl and _Dm _Dm is trained by the difference image pair _L31 _L31 and _M31 _M31 [15]. Then, the high resolution difference image _L21 _L21 can be produced by using the dictionary pair to encode the corresponding low resolution difference image _M21 _M21 . _L32 _L32 can be obtained in the same way. Finally, the high resolution image at time _t2 _t2can be predicted as follows:

_L2=_W21∗(_L1+_L21)+_W32∗(_L3−_L32).

View Image - Figure 1. Input images and the target image for spatiotemporal fusion ( t1<t2<t3 t1<t2<t3 ). Three low resolution images M1 M1 , M2 M2 and M3 M3 , and two high resolution images L1 L1 and L3 L3 are known. The high resolution image L2 L2 is the target image to be predicted.

Figure 1. Input images and the target image for spatiotemporal fusion ( t1<t2<t3 t1<t2<t3 ). Three low resolution images M1 M1 , M2 M2 and M3 M3 , and two high resolution images L1 L1 and L3 L3 are known. The high resolution image L2 L2 is the target image to be predicted.

The weights _W21 _W21 and _W32 _W32 we used are same as those in [19], which take the average of the two predicted difference images.

2.2. Dictionary Learning Fusion Model

As mentioned above, the conventional dictionary learning fusion models are usually comprised of two steps: the dictionary pair training step and the reconstruction step. The whole process is performed on each band separately. Here, we show the mathematical formulation of these two steps in SPSTFM [15], which is the initial dictionary learning model.

1. Dictionary pair training step:

In the training step, the difference image pair, _L31 _L31 and _M31 _M31 , is used to train the high resolution dictionary _Dl _Dl and the corresponding low resolution dictionary _Dm _Dmas follows:

[_Dl,_Dm]=argmin_Dl,_Dm_{Y−_DlA22}+_{X−_DmA22}+λ_A1,

where Y Y and X X are the column combination of the lexicographically stacked image patches, sampled randomly from _L31 _L31 and _M31 _M31 , respectively. A A is the column combination of the representation coefficients corresponding to every column in Y Y and X X , and λ λ is the regularization parameter. We adopt the K-SVD (K is the abbreviation for K-means and SVD is the abbreviation for Singular Value Decomposition) lgorithm [34] to solve for _Dl _Dl and _Dm _Dm in Equation (2).

* Reconstruction step:

Then, _Dm _Dm is used to encode each patch of _M21 _M21 and the sparse coding coefficient α αis obtained by solving the optimization problem:

^α∗=argminα12_{_m21−_Dmα22}+λ_α1,

where _m21 _m21 is a patch of _M21 _M21. The corresponding patch of the high resolution image can be produced by

_l21=_Dl ^α∗.

Then, all patches of _L21 _L21 are merged to get the high resolution image _L21 _L21 . _L32 _L32 can be obtained in the same way and the target image _L2 _L2 can be predicted through Equation (1).

2.3. Adaptive Multi-Band Constraints Fusion Model

Our model uses the same strategy for dictionary pair training and focuses on the improvement of the reconstruction step. We propose the following model for spatiotemporal fusion by replacing the _ℓ1 _ℓ1 norm with the nuclear norm _∥·∥∗ _∥·∥∗and incorporating the multi-band constraints. The reconstruction formulation is given as follows:

[_αN∗,_αR∗,_αG∗]=argmin∑c∈{N,R,G}12_{_mc−_{D_mc} _αc22}+λ_{_{D_mc}Diag(_αc)∗}+_τNR_{S_{D_lN} _αN−S_{D_lR} _αR22}+_τRG_{S_{D_lR} _αR−S_{D_lG} _αG22}+_τGN_{S_{D_lG} _αG−S_{D_lN} _αN22},

where λ λ , _τNR _τNR , _τRG _τRG and _τGN _τGN are the regularization parameters. _∥M∥∗ _∥M∥∗ is the nuclear norm of M that is the sum of all the singular values of matrix M. For a vector v v , Diag(v) Diag(v) represents a diagonal matrix whose diagonal elements are the corresponding elements of the vector v v. S is a high-pass detector filter. Here, we choose the two-dimensional Laplacian operator. The subscripts N, R and G mean the near-infrared (NIR), red and green band, respectively.

The dictionary pair _Dl _Dl and _Dm _Dm is trained by the difference images _L31 _L31 and _M31 _M31 which do not contain sufficient information of the images at time _t2 _t2 . When reconstructing _L21 _L21 or _L32 _L32 , if the model only uses the _ℓ1 _ℓ1 norm regularization, then the performance are unsatisfactory. It is more reasonable to integrate sparsity and correlation of the dictionaries. The nuclear norm term is just the kind of regularization that can adaptively balance sparsity and correlation via a suitable coefficient. As the property shown in [33,35],

_{_αc2}≤_{_{D_mc}Diag(_αc)∗}≤_{_αc1},

where all columns of _{D_mc} _{D_mc} have unit norm. When the column vectors of _{D_mc} _{D_mc} are orthogonal, _{_{D_mc}Diag(_αc)∗} _{_{D_mc}Diag(_αc)∗} is equal to _{_αc1} _{_αc1} . When the column vectors of _{D_mc} _{D_mc} are highly correlated, _{_{D_mc}Diag(_αc)∗} _{_{D_mc}Diag(_αc)∗} will be close to _{_αc2} _{_αc2} [35]. Generally, remote sensing images in the dictionary _{D_mc} _{D_mc} are neither too independent nor too correlated because the test images and training images can contain high correlative information (i.e., stable land-cover) and independent information (i.e., land-cover change). Therefore, as shown in Equation (6), our model can benefit from both the _ℓ1 _ℓ1 norm and the _ℓ2 _ℓ2 norm. The advantage of the nuclear norm is that the nuclear norm can capture the correlation structure of the training images which the _ℓ1 _ℓ1norm cannot.

The last three terms in the model are the multi-band regularization terms. Taking the NIR band as an example, _{D_lN} _αN _{D_lN} _αN denotes a high resolution patch of the NIR band and S_{D_lN} _αN S_{D_lN} _αNstands for the edge in the patch. These terms make the sparse codes (The codes may not be sparse based on the nuclear norm regularization, but, for convenience and without confusion, we still call them sparse codes.) in different bands no longer independent and reinforce the structure similarity of different bands.

Nevertheless, the nuclear norm regularization and multi-band regularization make it more complicated to solve the model. In Section 2.5, we propose the method to get a solution efficiently.

2.4. Adaptive Parameters between Bands

The ranges of reflectance in different bands are different in remote sensing images. In natural images, the range of the three channels is [0,255] [0,255] . Table 1 implicitly shows the range differences of three bands in terms of mean and standard deviation. Obviously, the structures are more similar when the means and standard deviations are closer. Based on this rationale, we propose an adaptive regularization parameter as follows:

_τNR=γ·^{10−min(_m¯N+_σN,_m¯R+_σR)max(_m¯N+_σN,_m¯R+_σR)},

where _m¯c _m¯c is the mean value of band c and _σc _σc is the standard deviation of band c; γ γ is a parameter to control the magnitude; and _τRG _τRG and _τGN _τGNcan be obtained by the definition as well.[ Table omitted. See PDF. ]

This parameter estimates the distribution of the reflectance values between two bands and produces a suitable parameter adaptively. When two bands have similar reflectance values, the parameter is close to γ γ. When the difference between two bands increases, the parameter decreases exponentially. The more similar two bands are, the more important a role the corresponding term plays in the model. This property fits such intuitive perception.

2.5. Optimization of the AMCFM

In this section, we solve the reconstruction model. For the optimization, a simplification is made first. We introduce the following vectors and matrices:

α=_αN_αR_αG,m=_mN_mR_mG,_Dm=_{D_mN}000_{D_mR}000_{D_mG},A=(_τGN+_τNR)_{D_lNT} ^STS_{D_lN}−_τNR _{D_lNT} ^STS_{D_lR}00(_τNR+_τRG)_{D_lRT} ^STS_{D_lR}−_τRG _{D_lRT} ^STS_{D_lG}−_τGN _{D_lGT} ^STS_{D_lN}0(_τRG+_τGN)_{D_lGT} ^STS_{D_lG},

where α α and m m are concatenations of the sparse coefficients and low resolution image patches, respectively. _Dm _Dm is a dictionary that contains low resolution dictionaries of three bands in its diagonals. We also define B Bas:

B=12_DmT _Dm+A.

Then, Equation (5) can be simplified as follows:

^α∗=argminα^αTBα−^mT _Dmα+λ_{_DmDiag(α)∗}.

Here, we use the alternating direction method of multipliers (ADMM) [36,37,38] algorithm to approximate the optimal solution of Equation (9). The optimization problem can be written as follows:

minα^αTBα−^mT _Dmα+λ_Z∗s.t._DmDiag(α)−Z=0,

where Z Z is the dual variable in the ADMM algorithm. The augmented Lagrangian function L Lof the optimization problem is given as

L=^αTBα−^mT _Dmα+λ_Z∗+ρ2_{_DmDiag(α)−Z+μF2},

where ρ ρ is a positive scalar and μ μis a scaled variable. The ADMM consists of the following iterations:

_αk+1=argminα^αTBα−^mT _Dmα+ρ2_{_DmDiag(α)−_Zk+_μkF2}_Zk+1=argminZλ_Z∗+ρ2_{_DmDiag(_αk+1)−Z+_μkF2}_μk+1=_μk+_DmDiag(_αk+1)−_Zk+1.

To minimize the augmented Lagrangian function, we solve each of the subproblems in Equation (12) by fixing the other two variables alternatively. For the step of updating _αk+1 _αk+1 , _αk+1 _αk+1can be deduced as follows:

_αk+1=argminαρ2(^αTDiag(diag(_DmT _Dm))α−2^{(diag(^ZT _Dm+^μT _Dm))T}α)+^αTBα−^mT _Dmα=argminα^αT(ρ2Diag(diag(_DmT _Dm))+B)α−(ρ^{(diag(^ZT _Dm+^μT _Dm))T}+^mT _Dm)α=^{(C+^CT)−1}(ρ^{(diag(^ZT _Dm+^μT _Dm))T}+^mT _Dm),

where C=ρ2Diag(diag(_DmT _Dm))+B C=ρ2Diag(diag(_DmT _Dm))+B . For a matrix M M , diag(M) diag(M) represents a vector whose ith element is the ith diagonal element of the matrix M M.

For the step of updating _Zk+1 _Zk+1 , _Zk+1 _Zk+1 can be calculated by the singular value thresholding operator [39] as follows:

_Zk+1=argminZ12_{Z−(_DmDiag(_αk+1)+_μk)F2}+λρ_Z∗=_Dλρ(_DmDiag(_αk+1)+_μk),

where _Dλρ _Dλρis the singular value shrinkage operator, which is defined as follows:

_Dλρ(X):=U_Dλρ(Σ)^V∗,_Dλρ(Σ)=diag({max(_σi−λρ,0)}),

where λρ λρ is a positive scalar, UΣ^V∗ UΣ^V∗ is the singular value decomposition of a matrix X X , _σi _σi is the ith positive singular value of X X , and max(_σi−λρ,0)=_σi−λρ,if_σi−λρ≥0,0,if_σi−λρ<0. max(_σi−λρ,0)=_σi−λρ,if_σi−λρ≥0,0,if_σi−λρ<0. Now, we use UΣ^V∗ UΣ^V∗ to denote the singular value decomposition of (_DmDiag(_αk+1)+_μk) (_DmDiag(_αk+1)+_μk) and use _σi _σi to denote the ith positive singular value of (_DmDiag(_αk+1)+_μk) (_DmDiag(_αk+1)+_μk). Then,

_Zk+1=U_Dλρ(Σ)^V∗=Udiag({max(_σi−λρ,0)})^V∗.

The implementation details of the whole reconstruction procedure based on the ADMM algorithm can be summarized in Algorithm 1.

Algorithm 1 Reconstruction Procedure of the Proposed Method

1: Input: The regularization parameter λ λ , learnt dictionary pair _Dl _Dl and _Dm _Dm , low resolution difference image and initial parameters _α0 _α0 , _Z0 _Z0 , _μ0 _μ0 , ρ ρ.

2: Preprocessing: Normalize the low resolution difference image, then segment the low resolution difference image into patches M=_{{_mi}i=1N} M=_{{_mi}i=1N}with a 7 × 7 patch size and a four-pixel overlap in each direction.

3: Calculate: The structure similarity parameters _τNR _τNR , _τRG _τRG and _τGN _τGN.

4: Repeat:

(1) Update the sparse coefficient _αk+1 _αk+1as:

_αk+1=^{(C+^CT)−1}(ρ^{(diag(^ZT _Dm+^μT _Dm))T}+^mT _Dm) _αk+1=^{(C+^CT)−1}(ρ^{(diag(^ZT _Dm+^μT _Dm))T}+^mT _Dm).

(2) Update _Zk+1 _Zk+1as:

_Zk+1=Udiag({max(_σi−λρ,0)})^V∗ _Zk+1=Udiag({max(_σi−λρ,0)})^V∗.

(3) Update _μk+1 _μk+1as:

_μk+1=_μk+_DmDiag(_αk+1)−_Zk+1 _μk+1=_μk+_DmDiag(_αk+1)−_Zk+1.

Repeat the above procedure until the convergence criterion _{_DmDiag(α)−ZF}≤ε _{_DmDiag(α)−ZF}≤ε is met or the pre-specified number of iterations is reached and get the desired sparse coefficient ^α∗ ^α∗.

5: Output: The corresponding patch of the high resolution image can be reconstructed as l=_Dl ^α∗ l=_Dl ^α∗ and the predicted image L Lcan be obtained by merging all patches.

2.6. Strategy for More Bands

The proposed method considers the structure similarity of different bands and uses pairwise comparisons of the NIR, red and green bands. It should be noted that the relationship between n and m is quadratic ( m=ⁿ²−n2 m=ⁿ²−n2), where n and m represent the number of bands and the number of multi-band regularization terms, respectively. When n increases, the model will be much more complicated and difficult to solve.

Table 2 shows that adjacent bands have consistent bandwidths. This property indicates that structures of adjacent bands are more similar than those of the other pairs. It is thus reasonable to use adjacent bands constraints instead of pairwise comparisons of all bands. Otherwise, the number of combinations of the adjacent bands will get to be smaller and m will become linear to n ( m=n−1 m=n−1). Therefore, to efficiently extend it to more bands, we use the strategy that only considers structure similarity of two adjacent bands. This smaller model (AMCFM-s) can be reformulated as follows:

[_αN,_αR,_αG]=argmin∑c∈{N,R,G}12_{_mc−_{D_mc} _αc22}+λ_{_{D_mc} _αc∗}+_τNR_{S_{D_lN} _αN−S_{D_lR} _αR22}+_τRG_{S_{D_lR} _αR−S_{D_lG} _αG22}.

[ Table omitted. See PDF. ]

The procedure of solving AMCFM-s is the same as that of AMCFM. Details can be found in Section 2.5.

3. Experiments

The performance of our proposed method is compared to those of the four state-of-the-art methods for evaluation. ESTARFM [7] is a weighting method and CRSU [25] is an unmixing-based method. The other two are dictionary learning methods, named SPSTFM [15] and EBSCDL [17].

All programs are run in Windows 10 system (Microsoft, Redmond, Washington, DC, USA) and the processor is Intel Core i7-6700 3.40 GHz (Intel, Santa Clara, CA, USA). All of these fusion algorithms are coded in Matlab 2015a (MathWorks, Natick, MA, USA) except the ESTARFM, which is in IDL 8.5 (Harris Geospatial Solutions, Broomfield, CO, USA).

3.1. Experimental Scheme

In this experiment, we use the data acquired from the Boreal Ecosystem-Atmosphere Study (BOREAS) southern study area on 24 May, 11 July and 12 August in 2001, respectively. The products from Landsat ETM+ and MODIS (MOD09GHK) are selected as the source data for fusion. The Landsat image on 11 July 2001 is set as the target image for prediction. All the data are registered for fine geographic calibration.

In the fusion process, we focus on three bands: NIR, red and green. The size of the test images is 300 × 300. Before the test, we up-sample the MODIS images to the same resolution as the Landsat images via bi-linear interpolation because the spatial resolutions of these two source images are different.

3.2. Parameter Settings and Normalization

The parameters of AMCFM are set as follows. The dictionary size is 256, the patch size is 7 × 7, the overlap of patches is 4, the number of training patches is 2000, λ λ is 0.15, _α0 _α0 is 0, _Z0 _Z0 and _μ0 _μ0 are both 0, and ρ ρis 0.1. All the comparative methods keep their original parameter settings.

Normalization can speed up the computation time and has an effect on the fusion results. As a preprocessing step, the high and low resolution images are normalized as follows:

L=L−L¯_σL,M=M−M¯_σM,

where L¯ L¯ is the mean value of image L L and _σL _σL is the standard deviation of image L L.

3.3. Quality Measurement of the Fusion Results

Several metrics have been used to evaluate the fusion results by different methods. These metrics can be classified into two types, namely the band quality metrics and the global quality metrics.

We employ three assessment metrics, namely the root mean square error (RMSE), average absolute difference (AAD) and correlation coefficient (CC) to assess the performance of the algorithms in each band. The ideal result is 0 for RMSE and AAD, while it is 1 for CC.

Three other metrics are adopted to evaluate the global performance, including relative average spectral error (RASE) [40], Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [41] and Q4 [42]. The mean RMSE (mRMSE) of three bands is also used as a global index. The ideal result is 0 for mRMSE, RASE and ERGAS, while it is 1 for Q4. It should be noted that Q4 is defined for four spectral bands. For our comparisons, the real part of a quaternion is set to 0.

3.4. Results

Table 3, Table 4 and Table 5 show the digital values of these methods in each band. All these methods can reconstruct the target high resolution image. ESTARFM has a good performance in the red band. CRSU performs well in the red band and green band of image 2, but, in most cases, this method has undesirable results. SPSTFM and EBSCDL have similar results and EBSCDL produces slightly higher quality in these three images. AMCFM and AMCFM-s produce the best results for NIR band. Moreover, AMCFM has the best or the second best results in almost all metrics, showing the stability and efficiency in its performance.

[ Table omitted. See PDF. ][ Table omitted. See PDF. ][ Table omitted. See PDF. ]

The global metrics of different methods are shown in Table 6, Table 7 and Table 8. AMCFM has the best global performance in all three images, except for Q4 in image 2 and ERGAS in image 3. Image 1 is best captured by our proposed model with a noticeable performance in all four metrics. The outstanding performance of AMCFM is attributed to its improved performance in the NIR band.

[ Table omitted. See PDF. ][ Table omitted. See PDF. ][ Table omitted. See PDF. ]

Figure 2 and Figure 3 compare the target (true) Landsat images with the images predicted by ESTARFM, CRSU, SPSTFM, EBSCDL, AMCFM and AMCFM-s. We use NIR-red-green band as the red-green-blue-band composite to show the images. These images are displayed with an ENVI 5.3 (Harris Geospatial Solutions, Broomfield, Colorado, United States) 2% linear enhancement.

View Image - Figure 2. Comparisons between the true image 1 and images reconstructed by different fusion methods. (a) MODIS (Moderate Resolution Imaging Spectroradiometer); (b) true Landsat image; (c) ESTARFM (Enhanced Spatial and Temporal Adaptive Reflectance Fusion Model); (d) CRSU (Class Regularized Spatial Unmixing); (e) SPSTFM (Sparse-representation-based Spatiotemporal Reflectance Fusion Model); (f) EBSCDL (Error-Bound-regularized Semi-Coupled Dictionary Learning); (g) AMCFM (adaptive multi-band constraints fusion model); (h) AMCFM-s.

Figure 2. Comparisons between the true image 1 and images reconstructed by different fusion methods. (a) MODIS (Moderate Resolution Imaging Spectroradiometer); (b) true Landsat image; (c) ESTARFM (Enhanced Spatial and Temporal Adaptive Reflectance Fusion Model); (d) CRSU (Class Regularized Spatial Unmixing); (e) SPSTFM (Sparse-representation-based Spatiotemporal Reflectance Fusion Model); (f) EBSCDL (Error-Bound-regularized Semi-Coupled Dictionary Learning); (g) AMCFM (adaptive multi-band constraints fusion model); (h) AMCFM-s.

View Image - Figure 3. Comparisons between the true image 2 and images reconstructed by different fusion methods. (a) MODIS; (b) true Landsat image; (c) ESTARFM; (d) CRSU; (e) SPSTFM; (f) EBSCDL; (g) AMCFM; (h) AMCFM-s.

Figure 3. Comparisons between the true image 2 and images reconstructed by different fusion methods. (a) MODIS; (b) true Landsat image; (c) ESTARFM; (d) CRSU; (e) SPSTFM; (f) EBSCDL; (g) AMCFM; (h) AMCFM-s.

All these fusion algorithms have the capability to reconstruct the main structure and details of the target image. It appears that the colors of the dictionary learning methods are visually more similar to the true Landsat image than the weighting method and unmixing-based method. The details captured by AMCFM are more prominent than those captured by SPSTFM and EBSCDL, which can be observed in the two-times enlarged red box in the images. Overall, our proposed method has the best performance in visualization.

Figure 4, Figure 5 and Figure 6 display the 2D scatter plots of NIR, red and green band of image 1. ESTARFM performs slightly better than the other methods in the red band. This result is consistent with the statistics in Table 3. However, in the NIR and green band, it is obvious that dictionary learning methods outperform the weighting method and unmixing-based method because scatter plots of ESTARFM and CRSU are more dispersed. The scatter plots of our proposed methods, AMCFM and AMCFM-s, are closer to the 1-1 line than the other methods, indicating that using the edge information can actually improve fusion performance, especially in the NIR band. In general, Figure 4, Figure 5 and Figure 6 show that our proposed methods reconstruct images closest to the true Landsat image.

Figure 4. Scatter plots of NIR band of image 1. Abscissa is the true reflectance and ordinate is the predicted reflectance.

Figure 5. Scatter plots of red band of image 1. Abscissa is the true reflectance and ordinate is the predicted reflectance.

Figure 6. Scatter plots of green band of image 1. Abscissa is the true reflectance and ordinate is the predicted reflectance.

4. Discussion

Although the model performs well in the experiments, there still exists some questions to be discussed. Therefore, more experiments are performed to answer these questions.

4.1. Which Condition Is Better for AMCFM

Table 6, Table 7 and Table 8 show that image 1 best fits our model. This can be explained by the level of details in Table 9. We employ “ StandardDeviationMean StandardDeviationMean ” to represent the level of details of a target image as in [19]. It is clear that image 1 has the highest level of details in NIR band and the most similar levels of details in the three bands. Therefore, more structure similarity information can be captured to improve the fusion results. When there is a large divergence in a certain band, such as the red band in image 2, the results of the dictionary learning methods in this band are unsatisfactory. Under this situation, the ESTARFM performs better in red band.

[ Table omitted. See PDF. ]

4.2. Computational Cost

Computational cost is an important factor in practical application. Table 10 records the running time of all algorithms in image 1. It shows that SPSTFM has the fastest running speed. EBSCDL is time-consuming because the algorithm models the relationship between high and low resolution patches by a mapping function. AMCFM is a little slower than EBSCDL because of the complexity of the ADMM algorithm. However, for the improvement in results obtained, the slightly increased running time is acceptable. To accelerate the computation, an alternative approach can be designed to solve the reconstruction model efficiently, or the program can be coded with Graphics Processing Unit (GPU) support for parallel running in the future work.

[ Table omitted. See PDF. ]

4.3. Parameters

The parameters of the multi-band constraints determine the importance of the corresponding terms in the fusion model and the scalar γ γ affects the value of the parameter τ τ directly. In order to find a suitable γ γ , Figure 7 depicts how Q4 changes with respect to γ γ . Q4 is an index which encapsulates both spectral and radiometric measurements of the fusion result. Thus, we choose it to reflect the fusion results. A larger value of Q4 means a better fusion performance. When γ γ is smaller than 10, the performance of AMCFM evidently improves with the increase of γ γ . However, Q4 hardly increases when γ γ is larger than 10. Therefore, we set γ γto 10.

Figure 7. Relationship between Q4 and γ γ .

5. Conclusions and Future Work

In this paper, we have proposed a novel dictionary learning fusion model, called AMCFM. This model considers the structure similarity between bands via adaptive multi-band constraints. These constraints essentially enforce the similarities of the edge information across bands in high resolution patches to improve the fusion performance. Moreover, different from existing dictionary learning models which only emphasize on sparsity, we use the nuclear norm as the regularization term to represent both sparsity and correlation. Therefore, our model can reduce the impact of inefficient dictionary pair and improve the representation ability of the dictionary pair. Comparing with four state-of-the-art fusion methods in metrics and visual effects, the experimental results support our proposed model in the improvements of image fusion. Although our model is slower than the other two dictionary learning methods in this empirical analysis because of the complexity of the optimization algorithm, the fusion results obtained from our model are improved indeed. One may wonder whether it is justifiable to achieve a slight improvement on the expense of an increase in computational time. Our argument is that, on a theoretical basis, our model is more reasonable and appealing than SPSTFM and EBSCDL because it capitalizes on the structure information and correlation of dictionaries for image fusion. Such advantages will be more evident when structure similarity increases.

However, there remains some room for improvement. Firstly, the _ℓ2 _ℓ2norm loss term assumes that noise is an i.i.d. Gaussian. We can consider the use of other noise hypotheses, such as i.i.d. Gaussian mixture and non-i.i.d noise structure, to improve the fusion results. Secondly, the computation cost of the proposed method is high because of the complexity of the ADMM algorithm. To reduce the computation time, an alternative approach can be designed to solve the reconstruction model efficiently for practical applications. To analyze hyperspectral data efficiently, dimension reduction methods might need to be incorporated into the fusion process.

Author Contributions

H.Y., Y.L., F.C., T.F. and J.X. conceived the research and designed the experiments. H.Y., Y.L. and F.C. formulated the model and wrote the paper. H.Y. performed the experiments and interpreted the results with Y.L. and T.F. J.X. collected and processed the experimental data. All authors reviewed and approved the manuscript.

Funding

This research was funded by the earmarked grant CUHK 14653316 of the Hong Kong Research Grant Council.

Acknowledgments

The authors would like to thank the editor and reviewers for the valuable comments and suggestions on improving the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

1. Townshend, J.; Justice, C.; Li, W.; Gurney, C.; McManus, J. Global land cover classification by remote sensing: Present capabilities and future possibilities. Remote Sens. Environ. 1991, 35, 243–255.

2. Loveland, T.R.; Shaw, D.M. Multiresolution land characterization: Building collaborative partnerships. In Gap Analysis: A Landscape Approach to Biodiversity Planning; Scott, J.M., Tear, T.H., Davis, F.W., Eds.; American Society for Photogrammetry and Remote Sensing: Bethesda, MD, USA, 1996; pp. 83–89. ISBN 978-5708303608.

3. Vogelmann, J.E.; Howard, S.M.; Yang, L.; Larson, C.R.; Wylie, B.K.; Van Driel, N. Completion of the 1990s National Land Cover Data Set for the conterminous United States from Landsat Thematic Mapper data and ancillary data sources. Photogramm. Eng. Remote Sens. 2001, 67, 650–662.

4. Pohl, C.; Van Genderen, J.L. Review article multisensor image fusion in remote sensing: Concepts, methods and applications. Int. J. Remote Sens. 1998, 19, 823–854.

5. Chen, B.; Huang, B.; Xu, B. Comparison of spatiotemporal fusion models: A review. Remote Sens. 2015, 7, 1798–1835.

6. Gao, F.; Masek, J.; Schwaller, M.; Hall, F. On the blending of the Landsat and MODIS surface reflectance: Predicting daily Landsat surface reflectance. IEEE Trans. Geosci. Remote Sens. 2006, 44, 2207–2218.

7. Zhu, X.; Chen, J.; Gao, F.; Chen, X.; Masek, J.G. An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions. Remote Sens. Environ. 2010, 114, 2610–2623.

8. Hilker, T.; Wulder, M.A.; Coops, N.C.; Linke, J.; McDermid, G.; Masek, J.G.; Gao, F.; White, J.C. A new data fusion model for high spatial-and temporal-resolution mapping of forest disturbance based on Landsat and MODIS. Remote Sens. Environ. 2009, 113, 1613–1627.

9. Li, H.; Manjunath, B.S.; Mitra, S.K. Multisensor image fusion using the wavelet transform. Graph. Models Image Process. 1995, 57, 235–245.

10. Nunez, J.; Otazu, X.; Fors, O.; Prades, A.; Pala, V.; Arbiol, R. Multiresolution-based image fusion with additive wavelet decomposition. IEEE Trans. Geosci. Remote Sens. 1999, 37, 1204–1211.

11. Miao, Q.G.; Shi, C.; Xu, P.F.; Yang, M.; Shi, Y.B. A novel algorithm of image fusion using shearlets. Opt. Commun. 2011, 284, 1540–1547.

12. Moigne, J.L.; Cromp, R.F. Wavelets for remote sensing image registration and fusion. In Wavelet Applications III; Szu, H.H., Ed.; SPIE-The International Society for Optical Engineering: Orlando, FL, USA, 1996; pp. 535–544. ISBN 978-0819421432.

13. Czaja, W.; Doster, T.; Murphy, J.M. Wavelet packet mixing for image fusion and pan-sharpening. In Proceedings of the Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XX, Baltimore, MD, USA, 5–9 May 2014; p. 908803.

14. Deng, C.; Wang, S.; Chen, X. Remote sensing images fusion algorithm based on shearlet transform. In Proceedings of the 2009 International Conference on Environmental Science and Information Application Technology, Wuhan, China, 4–5 July 2009; pp. 451–454.

15. Huang, B.; Song, H. Spatiotemporal reflectance fusion via sparse representation. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3707–3716.

16. Wang, S.; Zhang, L.; Liang, Y.; Pan, Q. Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis. In Proceeding of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 2216–2223.

17. Wu, B.; Huang, B.; Zhang, L. An error-bound-regularized sparse coding for spatiotemporal reflectance fusion. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6791–6803.

18. Wei, J.; Wang, L.; Liu, P.; Song, W. Spatiotemporal fusion of remote sensing images with structural sparsity and semi-coupled dictionary learning. Remote Sens. 2016, 9, 21.

19. Wei, J.; Wang, L.; Liu, P.; Chen, X.; Li, W.; Zomaya, A.Y. Spatiotemporal fusion of modis and landsat-7 reflectance images via compressed sensing. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7126–7139.

20. Zurita-Milla, R.; Kaiser, G.; Clevers, J.G.P.W.; Schneider, W.; Schaepman, M.E. Downscaling time series of MERIS full resolution data to monitor vegetation seasonal dynamics. Remote Sens. Environ. 2009, 113, 1874–1885.

21. Wu, M.; Wu, C.; Huang, W.; Niu, Z.; Wang, C.; Li, W.; Hao, P. An improved high spatial and temporal data fusion approach for combining Landsat and MODIS data to generate daily synthetic Landsat imagery. Inf. Fusion 2016, 31, 14–25.

22. Amorós-López, J.; Gómez-Chova, L.; Alonso, L.; Guanter, L.; Zurita-Milla, R.; Moreno, J.; Camps-Valls, G. Multitemporal fusion of Landsat/TM and ENVISAT/MERIS for crop monitoring. Int. J. Appl. Earth Obs. Geoinf. 2013, 23, 132–141.

23. Doxani, G.; Mitraka, Z.; Gascon, F.; Goryl, P.; Bojkov, B.R. A spectral unmixing model for the integration of multi-sensor imagery: A tool to generate consistent time series data. Remote Sens. 2015, 7, 14000–14018.

24. Zhang, W.; Li, A.; Jin, H.; Bian, J.; Zhang, Z.; Lei, G.; Qin, Z.; Huang, C. An enhanced spatial and temporal data fusion model for fusing Landsat and MODIS surface reflectance to generate high temporal Landsat-like data. Remote Sens. 2013, 5, 5346–5368.

25. Xu, Y.; Huang, B.; Xu, Y.; Cao, K.; Guo, C.; Meng, D. Spatial and Temporal Image Fusion via Regularized Spatial Unmixing. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1362–1366.

26. Xue, J.; Leung, Y.; Fung, T. A Bayesian data fusion approach to spatio-temporal fusion of remotely sensed images. Remote Sens. 2017, 9, 1310.

27. Shi, C.; Liu, F.; Li, L.; Jiao, L.; Duan, Y.; Wang, S. Learning interpolation via regional map for pan-sharpening. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3417–3431.

28. Glasner, D.; Bagon, S.; Irani, M. Super-resolution from a single image. In Proceeding of the 12th IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan, 29 September–2 October 2009; pp. 349–356.

29. Khateri, M.; Ghassemian, H. A Self-Learning Approach for Pan-sharpening of Multispectral Images. In Proceedings of the IEEE International Conference on Signal and Image Processing Applications, Kuching, Malaysia, 12–14 September 2017; pp. 199–204.

30. Zhu, Z.Q.; Yin, H.; Chai, Y.; Li, Y.; Qi, G.; Zhu, Z.Q. A novel multi-modality image fusion method based on image decomposition and sparse representation. Inf. Sci. 2017, 432, 516–529.

31. Mousavi, H.S.; Monga, V. Sparsity-based color image super resolution via exploiting cross channel constraints. IEEE Trans. Image Process. 2017, 26, 5094–5106.

32. Zhao, J.; Hu, H.; Cao, F. Image super-resolution via adaptive sparse representation. Knowl.-Based Syst. 2017, 124, 23–33.

33. Wang, J.; Lu, C.; Wang, M.; Li, P.; Yan, S.; Hu, X. Robust face recognition via adaptive sparse representation. IEEE Trans. Cybern. 2014, 44, 2368–2378.

34. Aharon, M.; Elad, M.; Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 2006, 54, 4311.

35. Obozinski, G.; Bach, F. Trace Lasso: A trace norm regularization for correlated designs. In Proceedings of the Advances in Neural Information Processing Systems 24 (NIPS 2011), Granada, Spain, 12–15 December 2011; pp. 2187–2195.

36. Lin, Z.; Chen, M.; Ma, Y. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv, 2010; arXiv:1009.5055.

37. Lin, Z.; Liu, R.; Su, Z. Linearized alternating direction method with adaptive penalty for low-rank representation. In Proceedings of the Advances in Neural Information Processing Systems 24 (NIPS 2011), Granada, Spain, 12–15 December 2011; pp. 612–620.

38. Yang, J.; Yuan, X. Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization. Math. Comput. 2013, 82, 301–329.

39. Cai, J.F.; Candès, E.J.; Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 2010, 20, 1956–1982.

40. Ranchin, T.; Wald, L. Fusion of high spatial and spectral resolution images: The ARSIS concept and its implementation. Photogramm. Eng. Remote Sens. 2000, 66, 49–61.

41. Gevaert, C.M.; García-Haro, F.J. A comparison of STARFM and an unmixing-based algorithm for Landsat and MODIS data fusion. Remote Sens. Environ. 2015, 156, 34–44.

42. Alparone, L.; Baronti, S.; Garzelli, A.; Nencini, F. A global quality measurement of pan-sharpened multispectral imagery. IEEE Geosci. Remote Sens. Lett. 2004, 1, 313–317.

AuthorAffiliation

¹Department of Applied Mathematics, College of Sciences, China Jiliang University, Hangzhou 310018, China

²Department of Geography and Resource Management, The Chinese University of Hong Kong, Hong Kong, China

³Institute of Future Cities, The Chinese University of Hong Kong, Hong Kong, China

^*Author to whom correspondence should be addressed.

Word count: 6471

Show less

© 2018. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Remote sensing is an important means to monitor the dynamics of the earth surface. It is still challenging for single-sensor systems to provide spatially high resolution images with high revisit frequency because of the technological limitations. Spatiotemporal fusion is an effective approach to obtain remote sensing images high in both spatial and temporal resolutions. Though dictionary learning fusion methods appear to be promising for spatiotemporal fusion, they do not consider the structure similarity between spectral bands in the fusion task. To capitalize on the significance of this feature, a novel fusion model, named the adaptive multi-band constraints fusion model (AMCFM), is formulated to produce better fusion images in this paper. This model considers structure similarity between spectral bands and uses the edge information to improve the fusion results by adopting adaptive multi-band constraints. Moreover, to address the shortcomings of theℓ1norm which only considers the sparsity structure of dictionaries, our model uses the nuclear norm which balances sparsity and correlation by producing an appropriate coefficient in the reconstruction step. We perform experiments on real-life images to substantiate our conceptual augments. In the empirical study, the near-infrared (NIR), red and green bands of Landsat Enhanced Thematic Mapper Plus (ETM+) and Moderate Resolution Imaging Spectroradiometer (MODIS) are fused and the prediction accuracy is assessed by both metrics and visual effects. The experiments show that our proposed method performs better than state-of-the-art methods. It also sheds light on future research.

Details

Title

Sparsity-Based Spatiotemporal Fusion via Adaptive Multi-Band Constraints

Author

Hanchi Ying; Leung, Yee; Cao, Feilong; Fung, Tung; Xue, Jie

Publication year

2018

Publication date

Oct 2018

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs10101646

ProQuest document ID

2126895866

Sparsity-Based Spatiotemporal Fusion via Adaptive Multi-Band Constraints

Jump to:

Full Text

Abstract

Details

Suggested sources