1. Introduction
Humans can see in a wide range of lighting conditions because the human eye adjusts constantly to a broad range of natural luminance values in the environment. However, standard digital cameras typically fail to capture images with sufficient dynamic range because of the limited ranges of sensors. To alleviate this issue, high-dynamic-range (HDR) imaging has been developed to improve the range of color and contrast in captured images [1]. Given a series of low-dynamic-range (LDR) images captured at different exposures, an HDR image is produced by merging these LDR images.
Traditional methods for producing HDR images [2,3] are based on the assumption that the images are globally registered, i.e., there is no camera or object motion between images with different exposure values. However, misalignments are inevitable in the presence of foreground motion and small camera motions, thus they usually suffer from ghosting artifacts. Many solutions [4,5,6,7,8,9,10,11,12,13,14,15] have been developed to overcome this limitation. HDR imaging reconstruction relying on pixel rejection [4,5,6,7,8] simply rejects pixels in misaligned regions as outliers. Other methods rely on registration [9,10,11,12,13,14,15,16] to reconstruct HDR images by searching for the best matching regions in LDR images.
Based on the recent development of convolutional neural networks (CNNs), the performance of HDR imaging using CNNs [17,18,19,20,21,22] has been significantly improved. Eilertsen et al. [22] proposed an autoencoder network to produce HDR images from only a single image. Endo et al. [17] proposed to synthesize LDR images captured with different exposures (i.e., bracketed images) and then reconstruct an HDR image by merging the synthesized images. However, the reliance on a single input LDR image cannot handle the highly contrastive scenes since it is an ill-posed problem. Kalantari et al. [19] attempted to handle the misalignment problem of dynamic scenes by implementing the classical optical flow algorithm [23] as an alignment process. However, the classical optical flow algorithm shows large alignment errors, which products artifacts in misalignment region. In addition, the classical optical flow algorithm requires significant computational time. Although Wu et al. [20] formulated HDR imaging as an image translation problem without alignment, they failed to reconstruct the details of an HDR image in occluded regions. Yan et al. [21] proposed an attention-guided deep network for suppressing misaligned features during the merging process to avoid ghosting artifacts. However, their method [21] still suffers from ghosting artifacts, because they excluded alignment between LDR images in the presence of camera motion or foreground motion.
In this paper, we propose a novel end-to-end flow-based HDR method, including pyramid inter-attention module (PIAM) and dual excitation block (DEB) for the alignment and merging processes, respectively. Our method is the first to jointly estimate the correspondence between LDR images and reconstruct HDR images. Specifically, during the alignment process, we can align the non-reference feature to a reference feature by leveraging the PIAM, as shown in Figure 1. Furthermore, we use the DEB to recalibrate the LDR features spatially and channel-wise for boosting the representation of features for generating ghost-free HDR images in the merging process. The main contributions of this paper can be summarized as follows:
- We propose a novel CNN-based framework for ghost-free HDR imaging by leveraging pyramid inter-attention module (PIAM) which effectively aligns LDR images.
- We propose a dual excitation block (DEB), which recalibrates features both spatially and channel-wise by highlighting the informative features and excluding harmful components.
-
Extensive experiments on HDR datasets [11,19,24] demonstrate that the synergy between the two aforementioned modules enables our framework to achieve state-of-the-art performance.
2. Related Work 2.1. HDR Imaging without Alignment
We first review HDR imaging algorithms using the assumption that input LDR images are globally registered. Early work presented by Mann and Picard [2] attempted to combine differently exposed images to obtain a single HDR image. Debevec and Malik [3] recovered camera response function using differently exposed photographs with a static camera. Unger et al. [25] designed an HDR imaging system using a highly programmable camera unit and multi-exposure images. Khan et al. [26] computed the probabilities of pixels for part of an image background by iteratively weighting the contribution of each pixel. Jacobs et al. [5] removed ghosting artifacts by addressing brightness changes. Pece and Kautz [7] proposed a motion map to compute median threshold bitmaps for each image. Heo et al. [8] assigned weights to emphasize well-exposed pixels using Gaussian-weighted distance. Zhang and Cham [4] detected movement using quality measures based on image gradients to generate a weighting map. Lee et al. [27] and Oh et al. [28] explored rank minimization in HDR deghosting to detect motion and reconstruct HDR images. However, these solutions are impractical because they are not able to handle moving objects or camera motion.
2.2. HDR Imaging with Alignment
To solve the misalignment of dynamic scenes for HDR imaging, some approaches align LDR images prior to reconstructing an HDR image by applying dense correspondence algorithms (i.e., optical flow). Bogoni [10] aligned LDR images via warping using local motion vectors, which are estimated based on optical flow algorithm. Kang et al. [9] exploited the optical flow algorithm after performing exposure correction between LDR images. Jinno and Okuda [29] estimated dense correspondences based on a Markov random field model. Gallo et al. [14] proposed a fast non-rigid registration method for input images where small motion exists between them. However, these approaches cannot handle ghosting artifacts in the presence of large foreground motion, because they use a simple merging process for combining aligned LDR images.
There have been many attempts to integrate alignment and HDR reconstruction into a joint optimization process. Sen et al. [11] proposed a patch-based energy-minimization method that integrates alignment and reconstruction into a joint optimization process. Hu et al. [15] decomposed the optimization problem by using image alignment based on brightness and gradient consistency. Hafner et al. [12] proposed an energy-minimization approach that simultaneously calculates HDR irradiance and displacement fields. Despite these improvement of HDR imaging, such methods still have limitations when large motions and saturation exist in LDR images.
2.3. Deep-Learning-Based Methods
Recently, several deep CNN-based methods for HDR imaging [17,19,20,21,22] have been proposed. First, Eilertsen et al. [22] proposed a method for reconstructing HDR images from single LDR images using an autoencoder network. The method proposed by Endo et al. [17] predicts multiple LDR images with different exposures from a single LDR image, then reconstructs a final HDR image by merging the predicted images using a deep learning network. These methods have a limitation in that they use only a single LDR image, which makes it difficult to synthesize the details of an HDR image.
Kalantari et al. [19] attempted to solve the misalignment of LDR images by using an off-the-shelf optical flow algorithm [23]. They then merged the aligned LDR images to obtain an HDR image using CNNs. However, the optical flow algorithm [23] has a large computational time. Wu et al. [20] proposed a non-flow-based translation network that can elucidate plausible details from LDR inputs and generate ghost-free HDR images. Yan et al. [21] proposed an attention network to suppress the undesirable features due to the misalignment or saturation to avoid the ghosting artifacts. Although the methods discussed above represent remarkable advances in HDR imaging, they [20,21] cannot fully exploit the information from all LDR images. In contrast to these recent works [19,20,21], we incorporate a simple yet effective alignment network into the HDR imaging network to reconstruct details of HDR images by aligning LDR features.
2.4. Optical Flow
Alignment between LDR images is a key factor for generating ghost-free HDR images. The optical flow algorithm can be to perform alignment by finding the correspondence between the images. As a classical optical flow algorithm, the SIFT-flow algorithm [23] is an optimization-based algorithm for finding the optical flow between images. However, optimization-based methods require large computational times. Inspired by the success of CNNs, FlowNet [30] was the first end-to-end learning approach for optical flow. This method estimates the dense optical flow between two images based on a U-Net autoencoder architecture [31]. FlowNet 2.0 [32] stacks several basic FlowNet models for iterative refinement and significantly improves accuracy. Recently, PWC-Net [33] was proposed to warp features in each feature pyramid in a coarse-to-fine approach and achieve state-of-the-art performance with a lightweight framework. However, these deep-learning-based flow estimation methods for estimating optical flows cannot handle the large object motions.
2.5. Attention Mechanisms
Attention mechanisms have provided significant performance improvements for many computer vision tasks, such as image classification [34], semantic segmentation [35], and image generation [36,37]. In the works by Zhang et al. [36] and Wang et al. [34], self-attention mechanisms were proposed for modeling long-range dependencies solve the problem of limited local receptive fields that many deep generative model have. For stereoscopic super-resolution tasks, Wang et al. [38] proposed a parallax-attention module for finding stereo correspondence. They found reliable correspondences with smaller computational cost than other stereo matching networks [39,40,41] by leveraging a parallax-attention mechanism. Inspired by attention mechanisms, we effectively find correspondence between the LDR images captured in dynamic scenarios for reconstructing HDR images. Then we align the LDR features using the correspondences for fully exploiting these features. Although our method and Yan et al. [21] use the term "attention", there is a significant difference between these methods. The attention network proposed by the Yan et al. [21] focuses on highlighting meaning features for HDR imaging. In contrast, our method aligns LDR images for fully exploiting them for HDR imaging via inter-attention maps.
3. Proposed Method 3.1. Overview
An overview of the proposed method is presented in Figure 2. Using a set of LDR imagesI1,I2,…,Ikof a dynamic scene sorted by their exposure values, the proposed method aims to reconstruct a ghost-free HDR imageHrthat is aligned to the reference LDR imageIr . First, we apply gamma correction [19,20,21] for mapping each LDR imageIiinto the HDR domain according to its exposure timeti(i.e.,Ji=Iiγ/ti, where we setγ to 2.2 in this work), as a preprocessing step. Similar to previous approaches [19,20,21], the input for the proposed method is a concatenation ofIiandJi, wherei=1,2,3. After preprocessing, we feed each input into the feature extraction network, which is composed of several combinations of convolution and rectified linear unit (ReLU) function, resulting inEi.
To exploit the featuresEo,o∈{1,3}from other LDR images (i.e., non-reference images), the alignment network warps other featuresE1,E3by leveraging the proposed a pyramid inter-attention module (PIAM). The reference-aligned features and the reference feature are then merged to synthesize the details of the target HDR image. Although the alignment network aligns these features, alignment errors still exist in case of homogeneous regions or repetitive patterns. To handle this problem, we propose a dual excitation block (DEB) to recalibrate features for highlighting the informative features and excluding harmful features. Finally, the dilated residual dense blocks (DRDB) are used to learn hierarchical features for HDR imaging effectively.
3.2. Alignment Network
Since the features from LDR images are not aligned, we conduct alignment for fully exploiting them prior to merging features. When camera motion or a moving object exists in a scene, the alignment process is a key factor for reconstructing an HDR image. Unlike the method proposed by [19], which uses the classical optical flow algorithm [23], we propose a novel alignment network, called PIAM. Before we describe the details of the PIAM, we will illustrate inter-attention module (IAM).
3.2.1. Inter Attention Module
The IAM is inspired by self-attention mechanism [34,36], which estimates feature similarities for all pixels in a single image. While the self-attention mechanism finds self-similarity in a single image, the proposed IAM calculates the inter-similarity between LDR images for every pixel, which are used to align non-reference features toward the reference feature. In this section, we discuss the mechanism of the proposed IAM for the training and testing phase. Given two feature pairsFr,Fo∈RC×H×W, they are reshaped asRC×HW . As shown in Figure 3, both pairs pass through the1×1convolutions for sourceθsand targetθt. By multiplying these two feature maps, a correlation mapCo→r∈RHW×HWis generated such thatCo→r=θt FrT θsFo. This correlation map is softmax normalized to generate a soft inter-attention mapAo→r∈RHW×HW.
As the soft inter-attention mapAo→ris softmax normalized, it represents the matching probability for all spatial positions. However, in the optical flow algorithm, there is only one matching point for each pixel. To ensure that the inter-attention map represents only one matching point, a hard inter-attention mapBo→r(i,j)∈RHW×HWis generated as follows:
Bo→r(i,j)=1,for∀i andargmaxj′ Ao→r(i,j′ )0,otherwise.
With the hard inter-attention mapBo→r, we can warp the other featureFotoward reference oneFrusing matrix multiplication, resulting inFo′=Bo→r Fo. Finally, it is reshaped such thatFo′∈RC×H×W.
For training the IAM, we take the following additional steps. First, we generate an additional soft inter-attention mapAr→o . We can train the IAM using photometric loss in an unsupervised manner, as described in Section 3.4. Photometric loss requires forward warping results using the soft inter-attention map. However, the occlusion problem, which originates from forward warping using an inter-attention map, is inevitable. An occluded region causes the network to estimate unreliable correspondences when using photometric loss for flow estimation [42] in an unsupervised manner.
To ensure that the alignment network estimates reliable correspondences, we generate a validation mask for training the network. As suggested in [38], pixels in occluded regions typically have small weights in the inter-attention mapAr→o. We design the validation maskVr→o∈RHWfor the reference image and it can be obtained as follows:
Vr→o(j)=1,if∑i∈1,2,…,HWAr→o(i,j)>τ,0,otherwise,
whereHWis a multiplication of the height and width of featureFrandτis a threshold. Here, we set theτto 0.1 empirically. In the same manner, the validation maskVo→rcan be generated. The validation masksVr→o,Vo→r are used for photometric loss for training the IAM in an unsupervised manner, as described in Section 3.4.
3.2.2. Pyramid Inter-Attention Module
Finding global correspondences using the IAM for a large image requires a large amount of memory, which is described in Table 1. To alleviate this issue, we propose the PIAM, which consists of global IAM and local IAM, based on coarse-to-fine approaches for estimating correspondences [23,33]. As illustrated in Figure 4, feature pairsEr,Eo∈RC×H×Wpass through two stages of feature extraction network. The first feature extraction network outputs feature pairFrl,Fol∈RC×H×W, the size of which is the same as the resolution ofEr,Eo. The second network, which consists of n convolutions with stride-2, outputs feature pairFrg,Fog∈RC×(H/2n)×(W/2n).
The global-IAM first estimatesBo→rg∈R(HW/22n)×(HW/22n), which represents global correspondences, using the down-sampled featuresFrg,Fog. While other deep learning methods using coarse-to-fine approaches warp featuresFrl,Folusing up-sampled correspondences, we directly use global the correspondencesBo→rg. To match the size, we generatefol∈RC·22n×(H/2n)×(W/2n)by performing feature-grouping on the featuresFol∈RC×H×W as shown in Figure 4. The feature-grouping operation first divides featureFol∈RC×H×Winto grid of patches whose shape isRC×2n×2nand reshape each patch to the size ofRC·22n×1×1, then combines these patches to makefol∈RC·22n×(H/2n)×(W/2n). The coarse-globally aligned featureF′olis generated by performing feature-regrouping, which is the inverse operation of feature-grouping, on warped first level featureBo→rg fol.
Finally, we can find the local correspondence between the feature pairFrl,F′ol. To reduce the computational memory, in the local IAM, we divide both featuresFrl,F′olinto grids of patches such that the size of the patches isk×k, and then perform alignment with local patches to find local correspondences. We divide a feature into a grid, such thatFrl,n=Fil,1,…,Fil,N, whereN=(H/k)·(W/k)is the number of patches. It should be noted thatFl,ndenotes the n-th patch consisting ofFl. The local IAM takes each input pairsFrl,n,F′ol,n, and outputs local correspondenceBo→rl,n. With these local correspondences, we finally generate warped featureEo′.
3.3. Merging Network
After aligning other featuresE1,E3to the reference featureE2using the alignment network, we obtain the warped featuresE1′,E3′ . Despite the alignment process based on PIAM, the alignment error that PIAM cannot handle still exists. In order to eliminate the harmful effect of features in a region of misalignment or saturation, we designed a novel network that incorporates the dual excitation block (DEB) (Figure 5) and dilated residual dense block (DRDB) [21] during the merging process. Finally, the ghost-free HDR images are generated by reducing artifacts caused by misalignment and preserving details during the merging process.
3.3.1. Dual Excitation Block (DEB)
In contrast to other non-flow-based deep HDR methods [20,21], which only fuses misaligned featuresE1,E2,E3 , we fuse warped features using the PIAM. As shown in Figure 5, the input of the DEB is a fusion of warped features and a reference feature. Feature fusion is defined as follows:
Gfuse=ConcatE1′,E2,E3′
whereConcat()is a concatenation operation.
The DEB recalibrates the fused featureGfuse∈RC×H×W both spatially and channel-wise by multiplying its excitation. Excitation allocates weights spatially and channel-wise to the fused feature to suppress the harmful features and encourage informative features for generating ghost-free HDR images. The configuration of the DEB is illustrated in Figure 5. AfterGfuse passes several convolutions followed by ReLU functions and a sigmoid function, the DEB finally generates dual excitations. We can recalibrate fused feature by multiplying the excitation. Unlike the attention of Yan [21], we calculate both spatial and channel-wise excitation to refine fused features, whereas attention only represents the spatial excitation that the DEB outputs.
3.3.2. Dilated Residual Dense Block (DRDB)
The DRDB consists of dilated convolutions to facilitate large receptive field for acquiring additional contextual information. The residual and dense connections in the DRDB enable us to use all of the hierarchical features contained in fused features. The details of the DRDB are described in [21].
3.4. Training Losses
The proposed method consists of two tasks: alignment and HDR generation. We designed a loss function for training the alignment task that finds the correspondences between LDR images. Based on the procedure described in [19,20,21], we also use the HDR reconstruction loss. The overall loss function is defined as follows:
L=λLalign+LHDR,
whereλcontrols the ratio of training alignment among the overall loss.λwas empirically set to 0.5.
3.4.1. Alignment Loss
Since there are no labeled dense correspondences between LDR images in an HDR dataset, we train the PIAM in an unsupervised manner. We introduce photometric loss for training the alignment network, following [38,43]. Photometric loss works for the images with the same exposure value. However, in our case, the LDR images have different exposures. Therefore, we set the same brightness values, as suggested in [19]. The brightness constancy is maintained by raising the exposure of darker images to that of brighter images. For example, ifI1is darker thanI2, then their exposures are matched such thatM1=clipI1 (t2/t1)1/γandM2=I2, whereclipensures the range of the output is0,1,t1andt2are the exposure times ofI1andI2, respectively.
With exposure-corrected matched pairsMs,Mt, the PIAM can be trained using the soft inter-attention mapsAs→tin an unsupervised manner by minimizing photometric error in valid regionVs→t. To train the global IAM usingMs,Mt, we define global alignment loss such that:
Ls→tglobal=∑p(As→tg ms(p)−mt(p))⊙Vs→tg(p)1∑pVs→tg(p)1,
where s denotes a source, t denotes a target, ⊙ denotes element-wise multiplication and m is generated by feature-grouping on M. The global IAM first warpsMstoMtglobally, generatingMs′. We can train the local IAM using the local alignment loss as follows:
Ls→tlocal=∑n∑p(As→tl,n M′sn(p)−Mtn(p))⊙Vs→tl,n(p)1∑pVs→tl,n(p)1,
where s denotes a source, t denotes a target, and ⊙ denotes element-wise multiplication. In this work, we set the reference r to 2, and other o to 1 or 3. Therefore, the overall alignment loss for training the PIAM is defined as follows.
Lalign=L1→2global+L2→1global+L3→2global+L2→3global+L1→2local+L2→1local+L3→2local+L2→3local.
3.4.2. HDR Reconstruction Loss
Since the HDR images are usually displayed after tonemapping, the proposed HDR imaging network estimates a tonemapped HDR image H using theμ -law described in [19] as follows:
T(H)=log1+μHlog1+μ,
whereμis a parameter that controls the amount of compression. In this work, we setμto 5000. This tonemapping function is differentiable, which facilitates the training of our model in an end-to-end manner. The loss function for estimating an HDR image H withHgtis defined as follows:
LHDR=T(H)−T(Hgt)1.
4. Experiments 4.1. Implementation Details
All convolutional filters in feature extraction network are3×3filters, followed by ReLU functions. In the PIAM, the second level feature extraction network consists of three convolutions for8×down-sampling. For local the IAM, we set the size of the local patch to32×32 for both training and testing. The growth rate was set to 32 in the DRDB. Our network was implemented using Pytorch on a PC with an Nvidia RTX 2080 GPU. The network was trained using the Adam optimizer [44] withβ1=0.9,β2=0.99. The HDR imaging network was trained with a batch size of one and learning rate1×10−5, respectively. Data augmentation was performed by flipping the images or performing color channel swapping in the images. During training, the input images were randomly cropped to a size of256×256pixels. Training was completed after 200,000 iterations, when additional iterations could not provide any further improvements for alignment or HDR imaging. All methods including our method were implemented to produce640×960HDR images in the experiments.
4.2. Experimental Settings
4.2.1. Datasets
The proposed HDR imaging network was trained using Kalantari’s HDR dataset [19] according to the process presented in previous works [19,20,21]. Kalantari’s HDR dataset provides ground truth HDR images, which facilitate training an HDR imaging network in a supervised manner. It consists of 74 sets for training and 15 sets for testing. Each set consists of three LDR images captured with different exposure values (−2,0+2or−3,0+3 ) and the ground truth HDR image is aligned to the reference image (middle exposure). The details of constructing the ground truth HDR image are discussed in [19]. After training our network on Kalantari’s HDR dataset [19], we compared the performance of our HDR imaging method with that of other state-of-the-art methods by testing on this dataset both qualitatively and quantitatively. We also used Sen’s dataset [11] and Tursun’s dataset [24] for visual comparisons since they do not contain ground truth HDR images.
4.2.2. Evaluation Metrics
We compared our method with the various state-of-the-art methods quantitatively on Kalantari’s dataset [19] because ground truth HDR images are available for this dataset. The evaluation metrics selected for measuring the quality of HDR images were PSNR-μ. PSNR-M, PSNR-L, and HDR-VDP-2. We computed the PSNR-μvalues between the generated HDR images and ground truth HDR images after tonemapping usingμ law. Additionally, evaluation metrics based on Matlab’s tonemap function (PSNR-M) and linear domains (PSNR-L) were adopted. To focus on the visual quality of HDR iamges, we also measured HDR-VDP-2 values [45].
4.3. Comparison with the State-of-the-Art Methods
We compare our method with the recent state-of-the-art methods, including hand-crafted [11,15,28] and CNN-based methods [17,19,20,21,22], on Kalantari et al.’s dataset [19] in Section 4.4 and datasets without ground truth images [11,24] in Section 4.5. For fair comparison, we used the same environment such as training dataset and implementation details for CNN-based methods [17,19,20,21,22]. All results were obtained using the code provided by the original authors.
4.4. Experiments on Kalantari et al.’s Dataset
4.4.1. Qualitative Comparison
Figure 6 presents visual comparisons of HDR images for the proposed method and the state-of-the-art methods on the testing set of the Kalantari HDR dataset [19]. The method proposed by
Oh et al. [28] cannot detect object motion, resulting in large ghosting artifacts due to the misalignment. Especially, the results of Oh et al. [28] are strongly influenced by LDR images with low exposure values. HDR imaging methods using single images, such as TMO [17] and HDRCNN [22], cannot elucidate the details of ground truth HDR images, since they only use a single reference image. Among the CNN-based methods for fusing LDR images, Wu et al. [20] and Yan et al. [21] do not conduct alignment prior to merging. Therefore, they suffer from ghosting artifacts caused by misalignment. The method proposed by Yan et al. [21] generates more plausible results than that proposed by Wu et al. because it uses attention maps, which is a similar mechanism to our spatial excitations. Although the method proposed by Kalantari et al. [19] conducts alignment prior to merging, it produces saturated results because it cannot suppress harmful features during the merging process. In contrast, our method is free from any artifacts, resulting in more plausible results than any other method, since we conduct alignment and recalibrate features by levering the PIAM and DEB.
4.4.2. Quantitative Comparison
We measured the performance of recent state-of-the-art methods and our method for quantitative evaluation on Kalantari HDR dataset [19]. We tested 15 images from testing dataset, measured the all evaluation metrics described above, and calculated average values. The results are presented in Table 2. In terms of all of the evaluation metrics, our method yields the best HDR imaging results. This is mainly because our method can fully exploit the all LDR features through alignment and recalibrate features for highlighting the informative features and excluding harmful components.
4.5. Experiments on Datasets without Ground Truth
Qualitative Comparison
Figure 7 presents visual comparisons of HDR images for the proposed method and the state-of-the-art methods on the testing set of datasets without ground truth [11,24]. Oh et al.’s [28] method cannot detect large object motion, resulting in large ghosting artifacts in Figure 7a. The methods relying on single images [17,22] and Kalantari et al.’s method [19] exhibit similar color distortions for both datasets. Wu et al.’s method [20] yields color distortions and ghosting artifacts. The method proposed by Yan et al. [21] fails to preserve color consistency and generates ghost artifacts due to misalignment. In contrast, our method generates visually plausible results preserves details and color consistency without ghosting artifacts.
4.6. Analysis
4.6.1. Ablation Studies
To verify the effectiveness of our network architecture, we conducted ablation studies to quantify the effects of the proposed pyramid inter-attention module (PIAM) and dual excitation block (DEB). Table 3 compares the performances of HDR imaging networks with different components in terms of the target evaluation metrics. It can be observed that all of the evaluation metrics decrease where the PIAM or DEB are not applied in our network. (i.e., baseline network). As shown in Figure 8, the PIAM finds reliable correspondences between LDR features. By conducting alignment using the PIAM, performance increases because the PIAM enables the network to exploit well-aligned LDR features by providing more precise information to the merging network. Furthermore, the DEB also increases the performance of HDR imaging because it can re-calibrate features both spatially and channel-wise to boost the representation power of fused features for reconstructing a HDR image. Therefore, it refines fused features to make them more informative, resulting in improved performance. With the PIAM and DEB added to the baseline network, our method achieves the best performance. The PIAM boosts the performance by providing more precise information and the DEB boosts performance by recalibrating features.
4.6.2. Matching Accuracy Comparison
To demonstrate the superiority of our alignment process using the PIAM for HDR imaging, we compared our method with the conventional optical flow algorithm [23] and the deep-learning-based flow estimation method [33], by measuring the accuracy of these correspondence methods. To measure matching accuracy, we compare the structural difference between warped images and reference LDR images on testing set in Kalantari et al.’s dataset. Since the intensity of a reference-warped LDR image is different from that of the LDR reference image, we compared SSIM values. Figure 8 presents a qualitative comparison of alignment results for our method, SIFT-flow [23], and PWC-Net [33]. As shown in Figure 8, PWC-Net fails to find large correspondences between LDR images because it is designed to cover small displacement. Although SIFT-flow finds large correspondences, it cannot preserve the details around the boundary of the moving object in the warped image. In contrast to these methods, our method yields more reliable correspondences. In Table 4, it can be seen that the proposed PIAM yields more accurate alignment performance than conventional the optical flow algorithm [23] used in Kalantari et al.’s [19], resulting in enhanced performance for HDR imaging.
4.6.3. Run Time Comparison
Table 5 presents the run time comparisons between various methods. All algorithms were executed on a PC with an i7-4790K (4.0GHz) CPU, 28 GB of RAM, and an Nvidia RTX 2080 GPU. It should be noted that the optimization-based HDR method [28] and HDR method [19] using the classical optical flow algorithm [23] were executed using the CPU. Our method is slower than the other deep-learning-based method except for Kalantari et al.’s method, which uses the conventional optical flow algorithm. Although the PIAM in our method increases the run time, it is still approximately 60 times faster than Kalantari et al.’s method. It should be noted that the other methods that are faster than our method do not contain alignment processes, resulting in the ghosting artifacts. Even though we conduct an alignment process similar to Kalantari et al.’s process, our method finds correspondences between LDR images more efficiently and effectively.
4.6.4. Cellphone Example
We also tested our method on cellphone images of both static and dynamic scenes to verify its practicality. For dynamic scenes, we tested with different types of motions such as camera motion or object motion. The HDR results are presented in Figure 9. One can see that our network produces plausible results in various types of settings. The LDR images were captured using a Samsung Galaxy S20 device with different exposure values. The exposure values for the cellphone were−4,−2,0, which are different from training settings for the proposed method. Even with different settings, the plausible results demonstrate the robustness of our network.
5. Conclusions We developed a novel end-to-end approach to reconstructing ghost-free HDR images of dynamic scenes. The proposed PIAM effectively aligns LDR features to exploit all LDR features for HDR reconstruction, even when large motion exists. Additionally, the DEB recalibrates the aligned features by multiplying the excitations spatially and channel-wise to boost the representation power of them. Ablation studies clearly demonstrated the effectiveness of the PIAM and DEB of our model. Finally, we have demonstrated that the proposed method is robust to dynamic scenes with large foreground motion, and outperforms state-of-the-art methods on standard benchmarks by a significant margin.
| Size of Tensors | H=256,W=256,n=3,k=32 | H=640,W=960,n=3,k=32 | |
|---|---|---|---|
| IAM | (HW)×(HW) | 4,294,967,296 | 377,487,360,000 |
| PIAM | (HW/22n)×(HW/22n)+(H/k)×(W/k)×(k)×(k) | 16 + 65,536 | 9,600 + 614,400 |
| PSNR-μ | PSNR-M | PSNR-L | HDR-VDP-2 | |
|---|---|---|---|---|
| Sen [11] | 40.924 | 30.572 | 37.934 | 55.145 |
| Hu [15] | 32.021 | 24.982 | 30.610 | 55.104 |
| Oh [28] | 26.151 | 21.051 | 25.131 | 45.526 |
| TMO [17] | 8.612 | 24.384 | 7.904 | 43.394 |
| HDRCNN [22] | 14.387 | 24.503 | 13.704 | 46.790 |
| Kalantari [19] | 41.170 | 30.705 | 40.226 | 59.909 |
| Wu [20] | 39.345 | 31.159 | 38.782 | 59.296 |
| Yan [21] | 42.017 | 31.798 | 40.978 | 61.104 |
| Ours | 43.212 | 32.415 | 41.697 | 62.481 |
| PSNR-μ | PSNR-M | PSNR-L | HDR-VDP-2 | ||
|---|---|---|---|---|---|
| baseline network | 38.514 | 31.475 | 38.021 | 58.457 | |
| +PIAM | 41.824 | 31.595 | 39.945 | 60.184 | |
| +DEB | 41.524 | 31.518 | 40.211 | 60.858 | |
| +PIAM +DEB | 43.212 | 32.415 | 41.697 | 62.481 |
| W/o Alignment | SIFT-Flow [23] | PWC-Net [33] | PIAM | |
|---|---|---|---|---|
| SSIM | 0.4326 | 0.6342 | 0.6042 | 0.6614 |
| Oh * [28] | HDRCNN [22] | Kalantari * [19] | Wu [20] | Yan [21] | Ours | |
|---|---|---|---|---|---|---|
| Times (s) | 65.153 | 0.245 | 40.518 | 0.215 | 0.301 | 0.594 |
Author Contributions
Conceptualization, J.C. (Jaehoon Cho); Data curation, J.Y.; Formal analysis, S.C.; Investigation, J.C. (Jaehoon Cho); Methodology, S.C.; Project administration, J.C. (Jihwan Choe); Software, J.C. (Jihwan Choe) and J.Y.; Supervision, K.S.; Validation, W.S.; Visualization, W.S.; Writing-original draft, S.C.; Writing-review & editing, K.S. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by Institute of Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2020-0-00056, To create AI systems that act appropriately and effectively in novel situations that occur in open worlds).
Conflicts of Interest
The authors declare no conflict of interest.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
This paper proposes a novel approach to high-dynamic-range (HDR) imaging of dynamic scenes to eliminate ghosting artifacts in HDR images when in the presence of severe misalignment (large object or camera motion) in input low-dynamic-range (LDR) images. Recent non-flow-based methods suffer from ghosting artifacts in the presence of large object motion. Flow-based methods face the same issue since their optical flow algorithms yield huge alignment errors. To eliminate ghosting artifacts, we propose a simple yet effective alignment network for solving the misalignment. The proposed pyramid inter-attention module (PIAM) performs alignment of LDR features by leveraging inter-attention maps. Additionally, to boost the representation of aligned features in the merging process, we propose a dual excitation block (DEB) that recalibrates each feature both spatially and channel-wise. Exhaustive experimental results demonstrate the effectiveness of the proposed PIAM and DEB, achieving state-of-the-art performance in terms of producing ghost-free HDR images.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer




