Full Text

Turn on search term navigation

1. Introduction

Accurate and useful visual perception is conventionally achieved by using RGB and depth sensors. Depth sensors, due to their small form factor, low cost, and low power consumption, are very popular in many fields of research such as robotics [1,2,3], medical imaging [4,5], augmented reality and consumer electronics. However, they typically tend to have lower spatial resolution than conventional imaging modalities such as RGB, leading to information loss, which can be overcome with accurate super-resolution techniques. In order to achieve this high resolution, existing techniques leverage correlations between sharp high-frequency texture edges of RGB images and low-resolution edge discontinuities of depth images. Typical super-resolution solutions often prove inadequate for addressing depth super-resolution due to their limited ability to effectively incorporate the unique characteristics and complexities inherent in the depth data [6]. To this end, the task of DSR is to provide solutions aimed at optimizing the super-resolution of lower resolution depth maps.

The depth super-resolution literature can be broadly categorized into three different approaches: filtering, optimization, and learning-driven strategies. Filtering-driven DSR [7,8,9,10,11,12] relies on finding filters based on neighboring pixels in the LR depth map and the HR guidance image. A downside of this approach is the creation of artifacts and errors when the scenes that need to be super-resolved are complex. Optimization-driven DSR [13,14,15,16] converts the task of super-resolution to an optimization problem where a cost function between the LR depth map and the HR depth map is minimized. This approach relies on selecting an appropriate cost function and is highly sensitive to the choice made. Finally, learning-driven approaches [17,18,19,20] in recent years have made use of deep learning techniques which have quickly become the de facto preferred solution of choice in the field of DSR.

In GDSR, most approaches rely on fusion techniques after the feature extraction stage. The unique per-modality features are fused together to create an HR depth map. It is important to have both strong feature extractors and fusion modules for this task as inadequate feature extractors do not provide the appropriate unique information on crucial features that help create sharp reconstructions. The RGB image is responsible for providing structure so that the end result does not suffer from depth bleeding. This task is not trivial, as over-transfer may occur and non-relevant structures from the RGB image can be transferred to the depth map. An example would be an image on the cover of a book where the network should be able to identify the textures that are irrelevant to depth reconstruction. Equally important to the feature extractors are the fusion modules that combine the two branches and refine the available information to estimate clear depths.

Existing works face two major limitations: (1) weak feature extractors that fail to capture the distinct and complementary characteristics of RGB and depth modalities, and (2) naive fusion strategies, such as simple concatenation or addition, which result in modality-specific artifacts, including over-transfer of irrelevant RGB features and insufficient depth-specific enhancement. These issues often lead to blurring, depth bleeding, and misaligned structures in the final depth maps. Our approach (see Figure 1 for an overview), IGAF, systematically addresses these challenges (the code is available on https://github.com/Thanos-DB/IncrementalGuidedAttentionFusion accessed on 9 December 2024). First, it introduces the filtered wide-focus (FWF) block to extract per-modality features with greater sensitivity to spatial textures and channel importance. Second, it employs an Incremental guided attention fusion (IGAF) module that refines features iteratively. Unlike previous methods, which fuse features in a single stage, IGAF leverages cross-modal attention to ensure only relevant information is transferred while suppressing noise. By iteratively fusing and refining features, our method minimizes artifacts, resulting in sharper, more accurate depth maps. To formalize, our contributions are as follows:

We propose the incremental guided attention fusion $(IGAF)$ model, which surpasses existing works for the task of DSR on all tested benchmark datasets for all tested benchmark resolutions.
We propose the $IGAF$ module, which is a flexible and adaptive attention fusion strategy, with the ability to effectively fuse multi-modal features by creating weights from both modalities and then applying a two-step cross fusion.
We propose the filtered wide-focus $(FWF)$ block, a strong feature extractor composed of two modules, the feature extractor $(FE)$ and wide-focus $(WF)$ . The $FE$ with the help of channel attention is able to highlight relevant feature channels in feature volumes, and the $WF$ using varying dilation rates in the convolution layers per branch, creates multi-receptive field spatial information that allows integrating global resolution features using dynamic receptive fields to better highlight textures and edges. The combination of the two forms a general-purpose feature extractor specifically tailored towards DSR.

2. Literature Review

Depth Super-Resolution Architectures. DSR techniques are broadly categorized into those that use RGB or grayscale images as guidance, and those that do not. Non-guided DSR techniques [17,21,22] try to solve the task by only using an LR depth map. This results in a simplified data acquisition pipeline (as syncing different modalities is not required, leading to smaller-sized datasets) alongside a simpler model by alleviating the need for sensor fusion techniques for the additional stream. This simplicity in the data processing, however, comes at the cost of producing smooth edges, especially on the contours of objects, as well as blurring and distortion effects in the super-resolved depth maps.

GDSR techniques propose a fix to over-smoothed edges by using structural and textural information from RGB or grayscale images. Additional techniques are needed to prevent the over-transfer of information from the guidance stream and to only retain the features that are relevant. Ref. [23] propose a fast model utilizing the high-frequency information of the guidance RGB stream using octave convolutions, but fuse the information from the two branches by a simple concatenation. Ref. [24], on the other hand, propose to fuse information between the two modalities through a symmetric uncertainty incorporated into their system. Ref. [25] use a joint implicit function representation to learn the interpolation weights and values for the HR depth simultaneously. Ref. [26] employ knowledge distillation such that the guidance stream is only needed during training while simplifying the model during the test phase. Ref. [27] utilize bridges to fuse information during multi-task learning. The two tasks in their system are depth super-resolution and monocular depth estimation. Additional novel techniques include [26,28,29].

Attention Feature Fusion in Depth Super-Resolution. Feature fusion techniques are crucial for multi-modal data processing. They range from a simple addition or concatenation of multiple features to complex feature processing modules. Ref. [30] place a hierarchical attention fusion module in the generator of a generative adversarial network towards this task. Ref. [31] employ an attention fusion strategy to adaptively utilize information from both modalities by first enhancing the features and then using an attention mechanism to fuse the two branches. Ref. [32] also propose a two-step approach where first a weighted attention fusion, followed by a high-frequency reconstruction generates the resulting high-resolution depth image. Ref. [20] use channel attention combined with reconstruction in the proposed module, whilst [33] have a fusion module consisting of feature enhancement and feature re-calibration step.

Existing works fail to effectively leverage both modalities to create fusion weights that accurately propagate relevant features. This limitation often results in the over-transfer of RGB features or insufficient depth-specific enhancements, leading to artifacts such as depth bleeding and texture misalignment. Our approach (see Figure 2) overcomes these drawbacks through a flexible and more powerful attention-based mechanism. By creating weights from one modality to iteratively guide the fusion with the other, we ensure that only the most relevant features are propagated. Unlike previous methods that rely on simple concatenation or addition for fusion, we introduce an Incremental guided attention fusion (IGAF) module that performs cross-modal attention in iterative steps. This process eliminates the over-transfer of RGB features while emphasizing critical depth-specific information. Specifically, we carry this out by first creating a naive fusion of the RGB and depth modalities (element-wise multiplication), followed by creating a structural guidance for the depth modality by learning a set of attention weights from the naive fusion for the RGB image (the first spatial attention fusion, i.e., SAF block; see figure below). We then use this intermediate fusion as structural guidance for the depth image to create a better fusion output than existing methods (the second SAF block).

3. Methodology

3.1. Problem Statement

Consider a dataset ${G, L, H}$ , where $G$ represents the RGB images, $L$ represents the RGB images’ corresponding LR depth maps and $H$ are the corresponding HR depth maps, for each RGB image $g_{i} \in R^{3 \times s H \times s W}$ . Each LR depth map is $l_{i} \in R^{1 \times H \times W}$ and HR depth map is $h_{i} \in R^{1 \times s H \times s W}$ , where H and W are the spatial resolutions of the images. The 1 in $l_{i}$ (and $h_{i}$ ) and 3 in $g_{i}$ refer to the number of input channels, while s is the scale factor between the HR and the LR depth maps. The model estimates an HR depth map $\hat{H}$ where ${\hat{h}}_{i} \in R^{1 \times s H \times s W}$ , by first upsampling $L$ to $L_{U}$ where $l_{U i} \in R^{1 \times s H \times s W}$ using bicubic interpolation such that the dimensions between $G$ and $L_{U}$ match. A formal representation is:

(1) $\hat{H} = L_{U} + F (G, L_{U}; θ),$

where

F (\cdot)

is the learned function that maps

L_{U}

and

G

\hat{H}

for the predicted HR depth map. Finally,

θ

represents the learned parameters. The addition operation in Equation (1) represents the global residual connection as seen in Figure 2.

3.2. Model Architecture

We follow the conventional architecture of a dual-stream model as depicted in Figure 2. Our model contains two inputs, one for the RGB guidance and the other for the upsampled LR depth map. First, each modality is processed via a convolutional layer followed by a LeakyReLU activation. This is followed by 3 $IGAF$ modules which extract and fuse the multi-modal features from the two input modalities. After the fusion modules, the depth is refined through our refinement block and a global skip connection adds the LR upsampled depth map to the final feature representation to produce the final prediction. The predicted depth map $\hat{H}$ is calculated as

(2) $\hat{H} = L_{U} + Depth_Refinement (IGAF (IGAF (IGAF (α (Conv (G)), α (Conv (L_{U})))))),$

where

α

is the LeakyReLU activation. The depth refinement consists of 3 feature extractor modules (see Section 3.2.1), and a convolution-LeakyReLU-convolution stack of layers.

3.2.1. The $IGAF$ Module

Each $IGAF$ module processes two inputs and provides two outputs (Figure 3). For the last $IGAF$ module, we only propagate the depth stream forward into the depth refinement block and ignore the second output.

At first, each stream passes through a two-piece feature extraction block, the $FWF$ consisting of a general feature extractor ( $FE$ ) and a $WF$ block. The $FE$ processes the input using a convolution–LeakyReLU–convolution stack of layers. Next, a channel attention $CA$ module focuses only on the relevant channels while reducing the influence of the less important or noisy ones. Finally, we employ an element-wise addition between the input of the module and the output of the $CA$ module, followed by another convolutional layer, and a skip connection that is global within the $FE$ module which propagates the global structure of the depth forward through the model. For simplicity in the explanations and equations, we treat N in the Figure below as 1, although during training N = 10 was used. The choice of the $CA$ module was empirically estimated after multiple trainings, alternating between channel attention, spatial attention, and a combination of both. For the $CA$ module, we have:

(3) $F_{C A} = K \times σ (Conv (ReLU (Conv (Avg . Pool (K))))),$

where

F_{C A}

represents the feature maps output by the

CA

module,

K

is the feature maps input and

σ

is the sigmoid activation. The

FE

module is represented as:

(4) $F_{F E} = M + Conv (M + CA (Conv (α (Conv (M))))),$

where

F_{F E}

is the feature maps output by the

FE

module,

M

is the feature maps input and

α

is the LeakyReLU activation. See Figure 4.

$WF$ is an efficient feature extractor first introduced by [34] for medical image segmentation and has shown great promise in extracting multi-scale features from feature representations. It contains three branches, each with a different dilation rate for the convolution kernels, followed by an activation layer and a dropout layer to prevent overfitting. After the element-wise addition, another convolutional layer extracts features from the gradually increased receptive fields of the $WF$ dilated convolution layer to aggregate the extracted multi-resolution features. This is followed by an activation layer and a dropout layer to avoid overfitting. A $WF$ block is used after every $FE$ module to aggregate the multi-resolution hierarchical features extracted in each layer. The RGB stream after the $FE$ module uses a skip connection to propagate the extracted features to the next $IGAF$ module further propagating the global scene structure forward within the model. We observed from our experiments that not placing a skip connection after $WF$ or at later stages is an effective strategy for learning better scene structure as the forward propagation of shallower features without the skip connection helps propagate high-frequency structure better through the model, which can be verified through our ablations in.

The first fusion in the $IGAF$ module (see Figure 3) is an element-wise multiplication of both modalities. Its result (a) creates intermediate feature weights $w_{b}$ and (b) is used in an element-wise weighted addition between the two intermediate features of the $IGAF$ block. Similarly, the extracted features from the RGB stream (a) create intermediate feature weights $w_{a}$ and (b) are the second component of the weighted addition. The addition can be seen as adding features from a joint representation of the RGB and LR images weighted by their common features. This helps the model focus on both high-level semantic structures in the image through the depth features, as well as high-frequency features from the RGB images while weighting them according to backpropagation. For each component, weights are extracted and applied in a crosswise fashion, i.e., weights from one component are applied to the other component, resulting in a spatial attention fusion $(SAF)$ block. This allows the model to learn features across both modalities that can limit their influence on the output resulting in a smoother depth map at the output. The weights are learnable and created via two-layer MLPs. A formalized expression of the SAF block is:

(5) $y_{A} = (x_{A} A_{A 1}^{T} + b_{A 1}) A_{A 2}^{T} + b_{A 2}$

(6) $y_{B} = (x_{B} A_{B 1}^{T} + b_{B 1}) A_{B 2}^{T} + b_{B 2}$

(7) $y = x_{A} σ (y_{B}) + x_{B} σ (y_{A}),$

where

x_{A}

and

x_{B}

are the two inputs of the block,

A

represents the weights,

b

the bias term and

σ

the sigmoid activation. Equations (5) and (6) show the 2 MLP layers for the weights creation that are used cross-wise with the inputs as seen in Equation (7).

The output of the first $SAF$ block passes through a convolutional layer for joint feature processing and is then used as input for the second $SAF$ block. This convolution layer now extracts shared features from the two fused modalities. The second $SAF$ block works in a similar manner to the previous one but now fuses the joint features from the two modalities with the depth features. The first $SAF$ module fuses together the extracted features from the RGB stream and the naive feature fusion obtained by the element-wise multiplication. The second $SAF$ block fuses together the results of the first $SAF$ block after the convolutional layer and the output of the $FE$ module of the depth stream. This fusion is incremental in nature as we iteratively combine RGB and depth features in multiple steps to create a cross-modal fusion of attributes leading to simultaneously processing both structure and depth.

4. Experiments

We test our model on four benchmark datasets commonly used widely in comparing proposed models for the DSR task. These are the NYU v2 [35], Middlebury [36,37], Lu [38] and RGB-D-D [23] datasets.

We only train $IGAF$ on the NYU v2 dataset and do not fine-tune the model further on the others. Our results on the remaining datasets are a zero-shot prediction to demonstrate the generalization ability of our model. The NYU v2 contains 1449 pairs of RGB and depth images. The first 1000 images are used to train the model and the remaining 449 are used for the purpose of evaluating our approach. For the Middlebury dataset, we use the provided 30 RGB and depth image pairs and for the Lu dataset we use the 6 pairs following previous works [23,24,25,38] in order to report consistent results with other proposed works. For RGB-D-D, we use 405 RGB and depth image pairs following [23].

Implementation Details: We run all our experiments using one RTX 3090 GPU using PyTorch 2.0.1. The initial learning rate for our model is set to 0.00025 and is reduced by half using the MultiStepLR scheduler with milestones. The milestones are set to every 25 epochs with the exception of the last one being 150 out of a total of 200 epochs. A batch size of 1 is used to train the model. We use the Adam optimizer and the Root Mean Square Error (RMSE) as our metric to report all our results. We use the $L_{1}$ loss to train our model:

(8) $L_{1} = \frac{1}{N} \sum_{i = 1}^{n} | h_{i} - {\hat{h}}_{i} |,$

where N is the number of pixels,

h_{i}

is the ground truth depth map and

{\hat{h}}_{i}

is the predicted depth map.

During training, we use $256 \times 256$ patches of the HR image that are randomly cropped. The LR depth maps are simulated by bicubic downsampling which is consistent with other approaches using the same datasets. Additionally, we evaluate our model using the “real-world manner” RGB-D-D dataset where both HR and LR are provided. The dimensions of the LR depth map is $192 \times 144$ and the HR depth map is $512 \times 384$ .

5. Results

Our model achieves state-of-the-art (SOTA) results on all benchmark test datasets compared to the baselines demonstrating its ability to super-resolve various depth resolutions as well as its generalization capabilities across multiple datasets. The Table 1, Table 2, Table 3, Table 4, Table 5 below show a quantitative comparison between our model and previous works. The evaluation is carried out based on the RMSE metric. The best performance is marked in bold. The Figure below showcases a qualitative comparison, on the NYU v2 dataset, between our model and SUFT [24]. The visualizations show how our proposed attention fusion helps alleviate problems such as bleeding and blurring that occur in previous SOTA models. This happens because IGAF iteratively refines features by leveraging structural guidance from RGB and selectively emphasizing depth-specific details. This approach minimizes the over-transfer of irrelevant RGB features, which is a key cause of blurring and bleeding in prior methods. The SAF blocks incrementally learn attention weights, ensuring sharper edges and reducing distortions. Qualitative comparisons in the Figure below demonstrate these advantages. See Figure 5.

6. Ablation Study

We run ablations on NYU v2 for the $\times 4$ DSR scenario. We study the effects of addition and concatenation as a fusion strategy by replacing $IGAF$ in our model with the two naive approaches. In Table 6, we show that our carefully built module based on empirical results outperforms them both as expected.

We also study the effects of different settings of the $IGAF$ module. The tested settings are (1) skip connections after the $WF$ modules to propagate deeper features, and not between the $FE$ and $WF$ modules, (2) an additional $IGAF$ module (four in total in the ablation model), (3) the $IGAF$ module without MLP layers, i.e., the element-wise additions are not weighted, (4) MLP layers consisting of only one dense layer instead of two, and lastly (5) removing the $WF$ module. Table 7 shows the importance of each component empirically.

We note that relocating the skip connection is not a good choice as propagating shallower features of high frequency has more spatial information. The additional module also does not improve the performance as the parameters of the model are increased. This larger model tends to overfit the training data. Keeping the weights of the addition improves the performance as the two parts are dynamically combined after the model has learned which features of each modality are important. The reduction of the MLP layers makes the approximation of the weights weaker which additionally supports the reasoning for using a two-layer MLP combined with our previous ablation. Lastly, without $WF$ , we lack the ability to dynamically increase our feature processing receptive fields that are provided by this module; as such, we lose the ability to capture multi-resolution features effectively.

7. Conclusions

Given the importance of depth perception in its various applications, the ability to estimate accurate, higher resolution depth information is crucial. We proposed an incremental guided attention fusion model for depth super-resolution that uses structural guidance from the RGB modality to provide intermediate structure to the processed features in every layer of the model which leads to the resultant HR output depth map being more accurate compared to the existing methods, as well as free of blurring effects and distortions. Our model’s main component, the $IGAF$ module, performs a cross-modal attention fusion that fuses RGB and depth modalities, while simultaneously focusing on the important information in the intermediate fused features. We achieve state-of-the-art performance on four benchmark datasets against all evaluated baselines for our metrics where the LR depth maps were downsampled from the HR ground truths. Specifically, we demonstrate the ability of our model to generate high-quality depth super-resolutions by training only on the NYU v2 dataset as well as the ability of our model to generalize in a zero-shot setting on the RGB-D-D, Lu and Middlebury datasets in a zero-shot setting which shows the robustness of our method. Additionally, on a 5th data set where the LR and HR depth maps were collected using different sensors, mimicking a real-world scenario, we also demonstrate better results than all existing methods.

Author Contributions

Conceptualization, A.T., C.K., K.J.M., H.D., D.F. and R.M.-S.; methodology, A.T., K.J.M., H.D. and C.K.; software, A.T. and R.M.-S.; validation, A.T., K.J.M. and C.K.; formal analysis, C.K., K.J.M. and A.T.; investigation, A.T.; resources, C.K., K.J.M., H.D., R.M.-S. and D.F.; data processing, A.T.; writing—Original draft preparation, A.T., C.K. and K.J.M.; writing—Review and editing, A.T., C.K. and K.J.M.; visualization, A.T.; supervision, D.F. and R.M.-S.; project administration, D.F. and R.M.-S.; funding acquisition, D.F. and R.M.-S. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of the University of Glasgow (application number 300220059 16 December 2022).

Data Availability Statement

The data underlying the results presented in this paper are available via their open-sourced links cited in the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Overview of the proposed multi-modal architecture for the guided depth super resolution estimation.

View Image - Figure 2. The proposed multi-modal architecture utilizes information from both an LR depth map and an HR RGB image. Firstly, each modality passes through a convolutional layer followed by a LeakyReLU activation. The model utilizes the IGAF modules to combine information from the two modalities by fusing the relevant information on each stream and ignoring information that is unrelated to the depth maps. Finally, after the third IGAF module, the depth maps are refined and added using a global skip connection from the original upsampled LR depth maps. The RGB modality is used to provide guidance to estimate an HR depth map given an LR one.

Figure 2. The proposed multi-modal architecture utilizes information from both an LR depth map and an HR RGB image. Firstly, each modality passes through a convolutional layer followed by a LeakyReLU activation. The model utilizes the IGAF modules to combine information from the two modalities by fusing the relevant information on each stream and ignoring information that is unrelated to the depth maps. Finally, after the third IGAF module, the depth maps are refined and added using a global skip connection from the original upsampled LR depth maps. The RGB modality is used to provide guidance to estimate an HR depth map given an LR one.

View Image - Figure 3. The [Forumla omitted. See PDF.] module. The module is responsible for both feature extraction and modality fusion. Each modality passes through a feature extraction stage [Forumla omitted. See PDF.] before the initial naive fusion by an element-wise multiplication. An [Forumla omitted. See PDF.] block follows, which fuses the result of the multiplication with the extracted features of the RGB stream creating an initial structural guidance. The second [Forumla omitted. See PDF.] block incrementally fuses this extracted structural guidance with the depth stream. The output of each [Forumla omitted. See PDF.] block is generated by learning attention weights and subsequently performing a cross-multiplication operation between the two input sequences, resulting in fused and salient processed information.

Figure 3. The [Forumla omitted. See PDF.] module. The module is responsible for both feature extraction and modality fusion. Each modality passes through a feature extraction stage [Forumla omitted. See PDF.] before the initial naive fusion by an element-wise multiplication. An [Forumla omitted. See PDF.] block follows, which fuses the result of the multiplication with the extracted features of the RGB stream creating an initial structural guidance. The second [Forumla omitted. See PDF.] block incrementally fuses this extracted structural guidance with the depth stream. The output of each [Forumla omitted. See PDF.] block is generated by learning attention weights and subsequently performing a cross-multiplication operation between the two input sequences, resulting in fused and salient processed information.

View Image - Figure 4. Overview of the [Forumla omitted. See PDF.] module. The two modules are separated and not combined into one larger module because the propagation of shallower features through the skip connections as seen in Figure 3 boosts the performance of the model. The [Forumla omitted. See PDF.] module is a series of convolutional layers, a channel attention process, and two skip connections. The [Forumla omitted. See PDF.] module uses linearly increasing dilation rates in convolutional layers to extract multi-resolution features.

Figure 4. Overview of the [Forumla omitted. See PDF.] module. The two modules are separated and not combined into one larger module because the propagation of shallower features through the skip connections as seen in Figure 3 boosts the performance of the model. The [Forumla omitted. See PDF.] module is a series of convolutional layers, a channel attention process, and two skip connections. The [Forumla omitted. See PDF.] module uses linearly increasing dilation rates in convolutional layers to extract multi-resolution features.

View Image - Figure 5. Qualitative comparison between our model and SUFT [24]. The visualizations shown are for the [Forumla omitted. See PDF.] case. Our model creates more complete depth maps as seen in (c) for rows 1 and 2. In (c), row 3 shows that our model creates sharper edges with minimal bleeding. Also, in (c), row 4 the proposed model creates less smoothing with less bleeding. (Colormap chosen for better visualization. Better seen in full-screen, with zoom-in options).

Figure 5. Qualitative comparison between our model and SUFT [24]. The visualizations shown are for the [Forumla omitted. See PDF.] case. Our model creates more complete depth maps as seen in (c) for rows 1 and 2. In (c), row 3 shows that our model creates sharper edges with minimal bleeding. Also, in (c), row 4 the proposed model creates less smoothing with less bleeding. (Colormap chosen for better visualization. Better seen in full-screen, with zoom-in options).

Table 1

Results on the NYU v2 data set.

Method	Bicubic	DG [39]	SVLRM [40]	DKN [41]	FDSR [23]	SUFT [24]	CTKT [26]	JIIF [25]	IGAF
×4	8.16	1.56	1.74	1.62	1.61	1.14	1.49	1.37	1.12
×8	14.22	2.99	5.59	3.26	3.18	2.57	2.73	2.76	2.48
×16	22.32	5.24	7.23	6.51	5.86	5.08	5.11	5.27	5.00

Table 2

Results on the RGB-D-D data set.

Method	Bicubic	DJFR [42]	PAC [43]	DKN [41]	FDKN [41]	FDSR [23]	JIIF [25]	SUFT [24]	IGAF
×4	2.00	3.35	1.25	1.30	1.18	1.16	1.17	1.20	1.08
×8	3.23	5.57	1.98	1.96	1.91	1.82	1.79	1.77	1.69
×16	5.16	7.99	3.49	3.42	3.41	3.06	2.87	2.81	2.69

Table 3

Results on the Lu data set.

Method	Bicubic	DMSG [44]	DG [39]	DJF [45]	DJFR [42]	PAC [43]	JIIF [25]	DKN [41]	IGAF
×4	2.42	2.30	2.06	1.65	1.15	1.20	0.85	0.96	0.82
×8	4.54	4.17	4.19	3.96	3.57	2.33	1.73	2.16	1.68
×16	7.38	7.22	6.90	6.75	6.77	5.19	4.16	5.11	4.14

Table 4

Results on the “real-world manner” RGB-D-D data set.

Method	Bicubic	DJF [45]	DJFR [42]	FDKN [41]	DKN [41]	FDSR [23]	JIIF [25]	SUFT [24]	IGAF
“real-world manner”	9.15	7.90	8.01	7.50	7.38	7.50	8.41	7.17	7.01

Table 5

Results on the Middlebury data set.

Method	Bicubic	PAC [43]	DKN [41]	FDKN [41]	CUNet [46]	JIIF [25]	SUFT [24]	FDSR [23]	IGAF
×4	2.28	1.32	1.23	1.08	1.10	1.09	1.20	1.13	1.01
×4	3.98	2.62	2.12	2.17	2.17	1.82	1.76	2.08	1.73
×16	6.37	4.58	4.24	4.50	4.33	3.31	3.29	4.39	3.24

Table 6

Demonstrating the importance of the $IGAF$ module.

Fusion Method	Addition	Concatenation	IGAF
×4	1.23	1.22	1.12

Table 7

Results on the NYU v2 data set.

Test	Relocated SkipConnection	Extra IGAFModule	WithoutWeights	One LayerMLP	WithoutWF	Full Model
×4	1.14	1.14	1.17	1.15	1.14	1.12

References

1. Huang, A.S.; Bachrach, A.; Henry, P.; Krainin, M.; Maturana, D.; Fox, D.; Roy, N. Visual odometry and mapping for autonomous flight using an RGB-D camera. Proceedings of the Robotics Research: The 15th International Symposium ISRR; Flagstaff, AZ, USA, 28 August–1 September 2011; Springer: Cham, Switzerland, 2017; pp. 235-252.

2. Stowers, J.; Hayes, M.; Bainbridge-Smith, A. Altitude control of a quadrotor helicopter using depth map from Microsoft Kinect sensor. Proceedings of the 2011 IEEE International Conference on Mechatronics; Istanbul, Turkey, 13–15 April 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 358-362.

3. Melchiorre, M.; Scimmi, L.S.; Pastorelli, S.P.; Mauro, S. Collison avoidance using point cloud data fusion from multiple depth sensors: A practical approach. Proceedings of the 2019 23rd International Conference on Mechatronics Technology (ICMT); Salerno, Italy, 23–26 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1-6.

4. Gao, X.; Uchiyama, Y.; Zhou, X.; Hara, T.; Asano, T.; Fujita, H. A fast and fully automatic method for cerebrovascular segmentation on time-of-flight (TOF) MRA image. J. Digit. Imaging; 2011; 24, pp. 609-625. [DOI: https://dx.doi.org/10.1007/s10278-010-9326-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20824304]

5. Penne, J.; Höller, K.; Stürmer, M.; Schrauder, T.; Schneider, A.; Engelbrecht, R.; Feußner, H.; Schmauss, B.; Hornegger, J. Time-of-flight 3-D endoscopy. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Marrakesh, Morocco, 7–11 October 2009; Springer: Cham, Switzerland, 2009; pp. 467-474.

6. Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Ji, X. Guided depth map super-resolution: A survey. ACM Comput. Surv.; 2023; 55, pp. 1-36. [DOI: https://dx.doi.org/10.1145/3584860]

7. Yang, Q.; Yang, R.; Davis, J.; Nistér, D. Spatial-depth super resolution for range images. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition; Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1-8.

8. Riemens, A.; Gangwal, O.; Barenbrug, B.; Berretty, R.P. Multistep joint bilateral depth upsampling. Proceedings of the Visual Communications and Image Processing; San Jose, CA, USA, 20–22 January 2009; SPIE: Bellingham, WA, USA, 2009; Volume 7257, pp. 192-203.

9. Liu, M.Y.; Tuzel, O.; Taguchi, Y. Joint geodesic upsampling of depth images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Portland, OR, USA, 23–28 June 2013; pp. 169-176.

10. Lo, K.H.; Wang, Y.C.F.; Hua, K.L. Edge-preserving depth map upsampling by joint trilateral filter. IEEE Trans. Cybern.; 2017; 48, pp. 371-384. [DOI: https://dx.doi.org/10.1109/TCYB.2016.2637661] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28129196]

11. Sun, Z.; Han, B.; Li, J.; Zhang, J.; Gao, X. Weighted guided image filtering with steering kernel. IEEE Trans. Image Process.; 2019; 29, pp. 500-508. [DOI: https://dx.doi.org/10.1109/TIP.2019.2928631] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31329117]

12. Qiao, Y.; Jiao, L.; Li, W.; Richardt, C.; Cosker, D. Fast, High-Quality Hierarchical Depth-Map Super-Resolution. Proceedings of the 29th ACM International Conference on Multimedia; Virtual Event, 20–24 October 2021; pp. 4444-4453.

13. Diebel, J.; Thrun, S. An application of markov random fields to range sensing. Adv. Neural Inf. Process. Syst.; 2005; 18, pp. 291-298.

14. Ferstl, D.; Reinbacher, C.; Ranftl, R.; Rüther, M.; Bischof, H. Image guided depth upsampling using anisotropic total generalized variation. Proceedings of the IEEE International Conference on Computer Vision; Sydney, Australia, 1–8 December 2013; pp. 993-1000.

15. Newcombe, R.A.; Fox, D.; Seitz, S.M. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA, 7–12 June 2015; pp. 343-352.

16. Park, J.; Kim, H.; Tai, Y.W.; Brown, M.S.; Kweon, I.S. High-quality depth map upsampling and completion for RGB-D cameras. IEEE Trans. Image Process.; 2014; 23, pp. 5559-5572. [DOI: https://dx.doi.org/10.1109/TIP.2014.2361034] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25291793]

17. Riegler, G.; Rüther, M.; Bischof, H. Atgv-net: Accurate depth super-resolution. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14; Springer: Cham, Switzerland, 2016; pp. 268-284.

18. Jiang, Z.; Yue, H.; Lai, Y.K.; Yang, J.; Hou, Y.; Hou, C. Deep edge map guided depth super resolution. Signal Process. Image Commun.; 2021; 90, 116040. [DOI: https://dx.doi.org/10.1016/j.image.2020.116040]

19. Song, X.; Dai, Y.; Qin, X. Deep depth super-resolution: Learning depth super-resolution using deep convolutional neural network. Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision; Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part IV 13 Springer: Cham, Switzerland, 2017; pp. 360-376.

20. Song, X.; Dai, Y.; Zhou, D.; Liu, L.; Li, W.; Li, H.; Yang, R. Channel attention based iterative residual learning for depth map super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, DC, USA, 14–19 June 2020; pp. 5631-5640.

21. Ye, X.; Sun, B.; Wang, Z.; Yang, J.; Xu, R.; Li, H.; Li, B. Depth super-resolution via deep controllable slicing network. Proceedings of the 28th ACM International Conference on Multimedia; Virtual Event, 12–16 October 2020; pp. 1809-1818.

22. Huang, L.; Zhang, J.; Zuo, Y.; Wu, Q. Pyramid-structured depth map super-resolution based on deep dense-residual network. IEEE Signal Process. Lett.; 2019; 26, pp. 1723-1727. [DOI: https://dx.doi.org/10.1109/LSP.2019.2944646]

23. He, L.; Zhu, H.; Li, F.; Bai, H.; Cong, R.; Zhang, C.; Lin, C.; Liu, M.; Zhao, Y. Towards fast and accurate real-world depth super-resolution: Benchmark dataset and baseline. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Virtual, 19–25 June 2021; pp. 9229-9238.

24. Shi, W.; Ye, M.; Du, B. Symmetric Uncertainty-Aware Feature Transmission for Depth Super-Resolution. Proceedings of the 30th ACM International Conference on Multimedia; Lisboa, Portugal, 10–14 October 2022; pp. 3867-3876.

25. Tang, J.; Chen, X.; Zeng, G. Joint implicit image function for guided depth super-resolution. Proceedings of the 29th ACM International Conference on Multimedia; Virtual, 20–24 October 2021; pp. 4390-4399.

26. Sun, B.; Ye, X.; Li, B.; Li, H.; Wang, Z.; Xu, R. Learning scene structure guidance via cross-task knowledge transfer for single depth super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Virtual, 19–25 June 2021; pp. 7792-7801.

27. Tang, Q.; Cong, R.; Sheng, R.; He, L.; Zhang, D.; Zhao, Y.; Kwong, S. Bridgenet: A joint learning network of depth map super-resolution and monocular depth estimation. Proceedings of the 29th ACM International Conference on Multimedia; Virtual, 20–24 October 2021; pp. 2148-2157.

28. Zhao, Z.; Zhang, J.; Xu, S.; Lin, Z.; Pfister, H. Discrete cosine transform network for guided depth map super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 18–24 June 2022; pp. 5697-5707.

29. Zhao, Z.; Zhang, J.; Gu, X.; Tan, C.; Xu, S.; Zhang, Y.; Timofte, R.; Van Gool, L. Spherical space feature decomposition for guided depth map super-resolution. arXiv; 2023; arXiv: 2303.08942

30. Xu, D.; Fan, X.; Gao, W. Multiscale Attention Fusion for Depth Map Super-Resolution Generative Adversarial Networks. Entropy; 2023; 25, 836. [DOI: https://dx.doi.org/10.3390/e25060836]

31. Wang, J.; Huang, Q. Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network. Appl. Sci.; 2023; 13, 8270. [DOI: https://dx.doi.org/10.3390/app13148270]

32. Song, X.; Zhou, D.; Li, W.; Dai, Y.; Liu, L.; Li, H.; Yang, R.; Zhang, L. WAFP-Net: Weighted Attention Fusion Based Progressive Residual Learning for Depth Map Super-Resolution. IEEE Trans. Multimed.; 2021; 24, pp. 4113-4127. [DOI: https://dx.doi.org/10.1109/TMM.2021.3118282]

33. Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Chen, Z.; Ji, X. High-resolution depth maps imaging via attention-based hierarchical multi-modal fusion. IEEE Trans. Image Process.; 2021; 31, pp. 648-663. [DOI: https://dx.doi.org/10.1109/TIP.2021.3131041] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34878976]

34. Tragakis, A.; Kaul, C.; Murray-Smith, R.; Husmeier, D. The fully convolutional transformer for medical image segmentation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Waikoloa, HI, USA, 2–7 January 2023; pp. 3660-3669.

35. Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision; Florence, Italy, 7–13 October 2012; Proceedings, Part V 12 Springer: Cham, Switzerland, 2012; pp. 746-760.

36. Hirschmuller, H.; Scharstein, D. Evaluation of cost functions for stereo matching. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition; Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1-8.

37. Scharstein, D.; Pal, C. Learning conditional random fields for stereo. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition; Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1-8.

38. Lu, S.; Ren, X.; Liu, F. Depth enhancement via low-rank matrix completion. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Columbus, OH, USA, 23–28 June 2014; 23–28 June pp. 3390-3397.

39. Gu, S.; Zuo, W.; Guo, S.; Chen, Y.; Chen, C.; Zhang, L. Learning dynamic guidance for depth image enhancement. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 3769-3778.

40. Pan, J.; Dong, J.; Ren, J.S.; Lin, L.; Tang, J.; Yang, M.H. Spatially variant linear representation models for joint filtering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 1702-1711.

41. Kim, B.; Ponce, J.; Ham, B. Deformable kernel networks for guided depth map upsampling. arXiv; 2019; arXiv: 1903.11286

42. Li, Y.; Huang, J.B.; Ahuja, N.; Yang, M.H. Joint image filtering with deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2019; 41, pp. 1909-1923. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2890623] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30605094]

43. Su, H.; Jampani, V.; Sun, D.; Gallo, O.; Learned-Miller, E.; Kautz, J. Pixel-adaptive convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 11166-11175.

44. Hui, T.W.; Loy, C.C.; Tang, X. Depth map super-resolution by deep multi-scale guidance. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14 Springer: Cham, Switzerland, 2016; pp. 353-369.

45. Li, Y.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep joint image filtering. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14 Springer: Cham, Switzerland, 2016; pp. 154-169.

46. Deng, X.; Dragotti, P.L. Deep convolutional neural network for multi-modal image restoration and fusion. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 43, pp. 3333-3348. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.2984244] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32248098]

Word count: 6083

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Accurate depth estimation is crucial for many fields, including robotics, navigation, and medical imaging. However, conventional depth sensors often produce low-resolution (LR) depth maps, making detailed scene perception challenging. To address this, enhancing LR depth maps to high-resolution (HR) ones has become essential, guided by HR-structured inputs like RGB or grayscale images. We propose a novel sensor fusion methodology for guided depth super-resolution (GDSR), a technique that combines LR depth maps with HR images to estimate detailed HR depth maps. Our key contribution is the Incremental guided attention fusion (IGAF) module, which effectively learns to fuse features from RGB images and LR depth maps, producing accurate HR depth maps. Using IGAF, we build a robust super-resolution model and evaluate it on multiple benchmark datasets. Our model achieves state-of-the-art results compared to all baseline models on the NYU v2 dataset for $\times 4$ , $\times 8$ , and $\times 16$ upsampling. It also outperforms all baselines in a zero-shot setting on the Middlebury, Lu, and RGB-D-D datasets. Code, environments, and models are available on GitHub.

Details

Title

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

Author

Tragakis, Athanasios¹; Kaul, Chaitanya²; Mitchell, Kevin J¹

; Dai, Hang²; Murray-Smith, Roderick²; Faccio, Daniele¹

¹ School of Physics and Astronomy, University of Glasgow, Glasgow G12 8QQ, UK; [email protected] (A.T.); [email protected] (K.J.M.); [email protected] (D.F.)
² School of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK; [email protected] (H.D.); [email protected] (R.M.-S.)

First page

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

14248220

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/s25010024

ProQuest document ID

3153691583

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

Jump to:

Full Text

2. Literature Review

3.1. Problem Statement

3.2. Model Architecture

3.2.1. The $IGAF$ Module

4. Experiments

6. Ablation Study

Abstract

Details

Suggested sources

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

Jump to:

Full Text

2. Literature Review

3.1. Problem Statement

3.2. Model Architecture

3.2.1. The IGAF Module

4. Experiments

6. Ablation Study

Abstract

Details

3.2.1. The $IGAF$ Module