1. Introduction
Accurate and useful visual perception is conventionally achieved by using RGB and depth sensors. Depth sensors, due to their small form factor, low cost, and low power consumption, are very popular in many fields of research such as robotics [1,2,3], medical imaging [4,5], augmented reality and consumer electronics. However, they typically tend to have lower spatial resolution than conventional imaging modalities such as RGB, leading to information loss, which can be overcome with accurate super-resolution techniques. In order to achieve this high resolution, existing techniques leverage correlations between sharp high-frequency texture edges of RGB images and low-resolution edge discontinuities of depth images. Typical super-resolution solutions often prove inadequate for addressing depth super-resolution due to their limited ability to effectively incorporate the unique characteristics and complexities inherent in the depth data [6]. To this end, the task of DSR is to provide solutions aimed at optimizing the super-resolution of lower resolution depth maps.
The depth super-resolution literature can be broadly categorized into three different approaches: filtering, optimization, and learning-driven strategies. Filtering-driven DSR [7,8,9,10,11,12] relies on finding filters based on neighboring pixels in the LR depth map and the HR guidance image. A downside of this approach is the creation of artifacts and errors when the scenes that need to be super-resolved are complex. Optimization-driven DSR [13,14,15,16] converts the task of super-resolution to an optimization problem where a cost function between the LR depth map and the HR depth map is minimized. This approach relies on selecting an appropriate cost function and is highly sensitive to the choice made. Finally, learning-driven approaches [17,18,19,20] in recent years have made use of deep learning techniques which have quickly become the de facto preferred solution of choice in the field of DSR.
In GDSR, most approaches rely on fusion techniques after the feature extraction stage. The unique per-modality features are fused together to create an HR depth map. It is important to have both strong feature extractors and fusion modules for this task as inadequate feature extractors do not provide the appropriate unique information on crucial features that help create sharp reconstructions. The RGB image is responsible for providing structure so that the end result does not suffer from depth bleeding. This task is not trivial, as over-transfer may occur and non-relevant structures from the RGB image can be transferred to the depth map. An example would be an image on the cover of a book where the network should be able to identify the textures that are irrelevant to depth reconstruction. Equally important to the feature extractors are the fusion modules that combine the two branches and refine the available information to estimate clear depths.
Existing works face two major limitations: (1) weak feature extractors that fail to capture the distinct and complementary characteristics of RGB and depth modalities, and (2) naive fusion strategies, such as simple concatenation or addition, which result in modality-specific artifacts, including over-transfer of irrelevant RGB features and insufficient depth-specific enhancement. These issues often lead to blurring, depth bleeding, and misaligned structures in the final depth maps. Our approach (see Figure 1 for an overview), IGAF, systematically addresses these challenges (the code is available on
We propose the incremental guided attention fusion model, which surpasses existing works for the task of DSR on all tested benchmark datasets for all tested benchmark resolutions.
We propose the module, which is a flexible and adaptive attention fusion strategy, with the ability to effectively fuse multi-modal features by creating weights from both modalities and then applying a two-step cross fusion.
We propose the filtered wide-focus block, a strong feature extractor composed of two modules, the feature extractor and wide-focus . The with the help of channel attention is able to highlight relevant feature channels in feature volumes, and the using varying dilation rates in the convolution layers per branch, creates multi-receptive field spatial information that allows integrating global resolution features using dynamic receptive fields to better highlight textures and edges. The combination of the two forms a general-purpose feature extractor specifically tailored towards DSR.
2. Literature Review
Depth Super-Resolution Architectures. DSR techniques are broadly categorized into those that use RGB or grayscale images as guidance, and those that do not. Non-guided DSR techniques [17,21,22] try to solve the task by only using an LR depth map. This results in a simplified data acquisition pipeline (as syncing different modalities is not required, leading to smaller-sized datasets) alongside a simpler model by alleviating the need for sensor fusion techniques for the additional stream. This simplicity in the data processing, however, comes at the cost of producing smooth edges, especially on the contours of objects, as well as blurring and distortion effects in the super-resolved depth maps.
GDSR techniques propose a fix to over-smoothed edges by using structural and textural information from RGB or grayscale images. Additional techniques are needed to prevent the over-transfer of information from the guidance stream and to only retain the features that are relevant. Ref. [23] propose a fast model utilizing the high-frequency information of the guidance RGB stream using octave convolutions, but fuse the information from the two branches by a simple concatenation. Ref. [24], on the other hand, propose to fuse information between the two modalities through a symmetric uncertainty incorporated into their system. Ref. [25] use a joint implicit function representation to learn the interpolation weights and values for the HR depth simultaneously. Ref. [26] employ knowledge distillation such that the guidance stream is only needed during training while simplifying the model during the test phase. Ref. [27] utilize bridges to fuse information during multi-task learning. The two tasks in their system are depth super-resolution and monocular depth estimation. Additional novel techniques include [26,28,29].
Attention Feature Fusion in Depth Super-Resolution. Feature fusion techniques are crucial for multi-modal data processing. They range from a simple addition or concatenation of multiple features to complex feature processing modules. Ref. [30] place a hierarchical attention fusion module in the generator of a generative adversarial network towards this task. Ref. [31] employ an attention fusion strategy to adaptively utilize information from both modalities by first enhancing the features and then using an attention mechanism to fuse the two branches. Ref. [32] also propose a two-step approach where first a weighted attention fusion, followed by a high-frequency reconstruction generates the resulting high-resolution depth image. Ref. [20] use channel attention combined with reconstruction in the proposed module, whilst [33] have a fusion module consisting of feature enhancement and feature re-calibration step.
Existing works fail to effectively leverage both modalities to create fusion weights that accurately propagate relevant features. This limitation often results in the over-transfer of RGB features or insufficient depth-specific enhancements, leading to artifacts such as depth bleeding and texture misalignment. Our approach (see Figure 2) overcomes these drawbacks through a flexible and more powerful attention-based mechanism. By creating weights from one modality to iteratively guide the fusion with the other, we ensure that only the most relevant features are propagated. Unlike previous methods that rely on simple concatenation or addition for fusion, we introduce an Incremental guided attention fusion (IGAF) module that performs cross-modal attention in iterative steps. This process eliminates the over-transfer of RGB features while emphasizing critical depth-specific information. Specifically, we carry this out by first creating a naive fusion of the RGB and depth modalities (element-wise multiplication), followed by creating a structural guidance for the depth modality by learning a set of attention weights from the naive fusion for the RGB image (the first spatial attention fusion, i.e., SAF block; see figure below). We then use this intermediate fusion as structural guidance for the depth image to create a better fusion output than existing methods (the second SAF block).
3. Methodology
3.1. Problem Statement
Consider a dataset , where represents the RGB images, represents the RGB images’ corresponding LR depth maps and are the corresponding HR depth maps, for each RGB image . Each LR depth map is and HR depth map is , where H and W are the spatial resolutions of the images. The 1 in (and ) and 3 in refer to the number of input channels, while s is the scale factor between the HR and the LR depth maps. The model estimates an HR depth map where , by first upsampling to where using bicubic interpolation such that the dimensions between and match. A formal representation is:
(1)
where is the learned function that maps and to for the predicted HR depth map. Finally, represents the learned parameters. The addition operation in Equation (1) represents the global residual connection as seen in Figure 2.3.2. Model Architecture
We follow the conventional architecture of a dual-stream model as depicted in Figure 2. Our model contains two inputs, one for the RGB guidance and the other for the upsampled LR depth map. First, each modality is processed via a convolutional layer followed by a LeakyReLU activation. This is followed by 3 modules which extract and fuse the multi-modal features from the two input modalities. After the fusion modules, the depth is refined through our refinement block and a global skip connection adds the LR upsampled depth map to the final feature representation to produce the final prediction. The predicted depth map is calculated as
(2)
where is the LeakyReLU activation. The depth refinement consists of 3 feature extractor modules (see Section 3.2.1), and a convolution-LeakyReLU-convolution stack of layers.3.2.1. The Module
Each module processes two inputs and provides two outputs (Figure 3). For the last module, we only propagate the depth stream forward into the depth refinement block and ignore the second output.
At first, each stream passes through a two-piece feature extraction block, the consisting of a general feature extractor () and a block. The processes the input using a convolution–LeakyReLU–convolution stack of layers. Next, a channel attention module focuses only on the relevant channels while reducing the influence of the less important or noisy ones. Finally, we employ an element-wise addition between the input of the module and the output of the module, followed by another convolutional layer, and a skip connection that is global within the module which propagates the global structure of the depth forward through the model. For simplicity in the explanations and equations, we treat N in the Figure below as 1, although during training N = 10 was used. The choice of the module was empirically estimated after multiple trainings, alternating between channel attention, spatial attention, and a combination of both. For the module, we have:
(3)
where represents the feature maps output by the module, is the feature maps input and is the sigmoid activation. The module is represented as:(4)
where is the feature maps output by the module, is the feature maps input and is the LeakyReLU activation. See Figure 4.is an efficient feature extractor first introduced by [34] for medical image segmentation and has shown great promise in extracting multi-scale features from feature representations. It contains three branches, each with a different dilation rate for the convolution kernels, followed by an activation layer and a dropout layer to prevent overfitting. After the element-wise addition, another convolutional layer extracts features from the gradually increased receptive fields of the dilated convolution layer to aggregate the extracted multi-resolution features. This is followed by an activation layer and a dropout layer to avoid overfitting. A block is used after every module to aggregate the multi-resolution hierarchical features extracted in each layer. The RGB stream after the module uses a skip connection to propagate the extracted features to the next module further propagating the global scene structure forward within the model. We observed from our experiments that not placing a skip connection after or at later stages is an effective strategy for learning better scene structure as the forward propagation of shallower features without the skip connection helps propagate high-frequency structure better through the model, which can be verified through our ablations in.
The first fusion in the module (see Figure 3) is an element-wise multiplication of both modalities. Its result (a) creates intermediate feature weights and (b) is used in an element-wise weighted addition between the two intermediate features of the block. Similarly, the extracted features from the RGB stream (a) create intermediate feature weights and (b) are the second component of the weighted addition. The addition can be seen as adding features from a joint representation of the RGB and LR images weighted by their common features. This helps the model focus on both high-level semantic structures in the image through the depth features, as well as high-frequency features from the RGB images while weighting them according to backpropagation. For each component, weights are extracted and applied in a crosswise fashion, i.e., weights from one component are applied to the other component, resulting in a spatial attention fusion block. This allows the model to learn features across both modalities that can limit their influence on the output resulting in a smoother depth map at the output. The weights are learnable and created via two-layer MLPs. A formalized expression of the SAF block is:
(5)
(6)
(7)
where and are the two inputs of the block, represents the weights, the bias term and the sigmoid activation. Equations (5) and (6) show the 2 MLP layers for the weights creation that are used cross-wise with the inputs as seen in Equation (7).The output of the first block passes through a convolutional layer for joint feature processing and is then used as input for the second block. This convolution layer now extracts shared features from the two fused modalities. The second block works in a similar manner to the previous one but now fuses the joint features from the two modalities with the depth features. The first module fuses together the extracted features from the RGB stream and the naive feature fusion obtained by the element-wise multiplication. The second block fuses together the results of the first block after the convolutional layer and the output of the module of the depth stream. This fusion is incremental in nature as we iteratively combine RGB and depth features in multiple steps to create a cross-modal fusion of attributes leading to simultaneously processing both structure and depth.
4. Experiments
We test our model on four benchmark datasets commonly used widely in comparing proposed models for the DSR task. These are the NYU v2 [35], Middlebury [36,37], Lu [38] and RGB-D-D [23] datasets.
We only train on the NYU v2 dataset and do not fine-tune the model further on the others. Our results on the remaining datasets are a zero-shot prediction to demonstrate the generalization ability of our model. The NYU v2 contains 1449 pairs of RGB and depth images. The first 1000 images are used to train the model and the remaining 449 are used for the purpose of evaluating our approach. For the Middlebury dataset, we use the provided 30 RGB and depth image pairs and for the Lu dataset we use the 6 pairs following previous works [23,24,25,38] in order to report consistent results with other proposed works. For RGB-D-D, we use 405 RGB and depth image pairs following [23].
Implementation Details: We run all our experiments using one RTX 3090 GPU using PyTorch 2.0.1. The initial learning rate for our model is set to 0.00025 and is reduced by half using the MultiStepLR scheduler with milestones. The milestones are set to every 25 epochs with the exception of the last one being 150 out of a total of 200 epochs. A batch size of 1 is used to train the model. We use the Adam optimizer and the Root Mean Square Error (RMSE) as our metric to report all our results. We use the loss to train our model:
(8)
where N is the number of pixels, is the ground truth depth map and is the predicted depth map.During training, we use patches of the HR image that are randomly cropped. The LR depth maps are simulated by bicubic downsampling which is consistent with other approaches using the same datasets. Additionally, we evaluate our model using the “real-world manner” RGB-D-D dataset where both HR and LR are provided. The dimensions of the LR depth map is and the HR depth map is .
5. Results
Our model achieves state-of-the-art (SOTA) results on all benchmark test datasets compared to the baselines demonstrating its ability to super-resolve various depth resolutions as well as its generalization capabilities across multiple datasets. The Table 1, Table 2, Table 3, Table 4, Table 5 below show a quantitative comparison between our model and previous works. The evaluation is carried out based on the RMSE metric. The best performance is marked in bold. The Figure below showcases a qualitative comparison, on the NYU v2 dataset, between our model and SUFT [24]. The visualizations show how our proposed attention fusion helps alleviate problems such as bleeding and blurring that occur in previous SOTA models. This happens because IGAF iteratively refines features by leveraging structural guidance from RGB and selectively emphasizing depth-specific details. This approach minimizes the over-transfer of irrelevant RGB features, which is a key cause of blurring and bleeding in prior methods. The SAF blocks incrementally learn attention weights, ensuring sharper edges and reducing distortions. Qualitative comparisons in the Figure below demonstrate these advantages. See Figure 5.
6. Ablation Study
We run ablations on NYU v2 for the DSR scenario. We study the effects of addition and concatenation as a fusion strategy by replacing in our model with the two naive approaches. In Table 6, we show that our carefully built module based on empirical results outperforms them both as expected.
We also study the effects of different settings of the module. The tested settings are (1) skip connections after the modules to propagate deeper features, and not between the and modules, (2) an additional module (four in total in the ablation model), (3) the module without MLP layers, i.e., the element-wise additions are not weighted, (4) MLP layers consisting of only one dense layer instead of two, and lastly (5) removing the module. Table 7 shows the importance of each component empirically.
We note that relocating the skip connection is not a good choice as propagating shallower features of high frequency has more spatial information. The additional module also does not improve the performance as the parameters of the model are increased. This larger model tends to overfit the training data. Keeping the weights of the addition improves the performance as the two parts are dynamically combined after the model has learned which features of each modality are important. The reduction of the MLP layers makes the approximation of the weights weaker which additionally supports the reasoning for using a two-layer MLP combined with our previous ablation. Lastly, without , we lack the ability to dynamically increase our feature processing receptive fields that are provided by this module; as such, we lose the ability to capture multi-resolution features effectively.
7. Conclusions
Given the importance of depth perception in its various applications, the ability to estimate accurate, higher resolution depth information is crucial. We proposed an incremental guided attention fusion model for depth super-resolution that uses structural guidance from the RGB modality to provide intermediate structure to the processed features in every layer of the model which leads to the resultant HR output depth map being more accurate compared to the existing methods, as well as free of blurring effects and distortions. Our model’s main component, the module, performs a cross-modal attention fusion that fuses RGB and depth modalities, while simultaneously focusing on the important information in the intermediate fused features. We achieve state-of-the-art performance on four benchmark datasets against all evaluated baselines for our metrics where the LR depth maps were downsampled from the HR ground truths. Specifically, we demonstrate the ability of our model to generate high-quality depth super-resolutions by training only on the NYU v2 dataset as well as the ability of our model to generalize in a zero-shot setting on the RGB-D-D, Lu and Middlebury datasets in a zero-shot setting which shows the robustness of our method. Additionally, on a 5th data set where the LR and HR depth maps were collected using different sensors, mimicking a real-world scenario, we also demonstrate better results than all existing methods.
Conceptualization, A.T., C.K., K.J.M., H.D., D.F. and R.M.-S.; methodology, A.T., K.J.M., H.D. and C.K.; software, A.T. and R.M.-S.; validation, A.T., K.J.M. and C.K.; formal analysis, C.K., K.J.M. and A.T.; investigation, A.T.; resources, C.K., K.J.M., H.D., R.M.-S. and D.F.; data processing, A.T.; writing—Original draft preparation, A.T., C.K. and K.J.M.; writing—Review and editing, A.T., C.K. and K.J.M.; visualization, A.T.; supervision, D.F. and R.M.-S.; project administration, D.F. and R.M.-S.; funding acquisition, D.F. and R.M.-S. All authors have read and agreed to the published version of the manuscript.
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of the University of Glasgow (application number 300220059 16 December 2022).
The data underlying the results presented in this paper are available via their open-sourced links cited in the paper.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1. Overview of the proposed multi-modal architecture for the guided depth super resolution estimation.
Figure 2. The proposed multi-modal architecture utilizes information from both an LR depth map and an HR RGB image. Firstly, each modality passes through a convolutional layer followed by a LeakyReLU activation. The model utilizes the IGAF modules to combine information from the two modalities by fusing the relevant information on each stream and ignoring information that is unrelated to the depth maps. Finally, after the third IGAF module, the depth maps are refined and added using a global skip connection from the original upsampled LR depth maps. The RGB modality is used to provide guidance to estimate an HR depth map given an LR one.
Figure 3. The [Forumla omitted. See PDF.] module. The module is responsible for both feature extraction and modality fusion. Each modality passes through a feature extraction stage [Forumla omitted. See PDF.] before the initial naive fusion by an element-wise multiplication. An [Forumla omitted. See PDF.] block follows, which fuses the result of the multiplication with the extracted features of the RGB stream creating an initial structural guidance. The second [Forumla omitted. See PDF.] block incrementally fuses this extracted structural guidance with the depth stream. The output of each [Forumla omitted. See PDF.] block is generated by learning attention weights and subsequently performing a cross-multiplication operation between the two input sequences, resulting in fused and salient processed information.
Figure 4. Overview of the [Forumla omitted. See PDF.] module. The two modules are separated and not combined into one larger module because the propagation of shallower features through the skip connections as seen in Figure 3 boosts the performance of the model. The [Forumla omitted. See PDF.] module is a series of convolutional layers, a channel attention process, and two skip connections. The [Forumla omitted. See PDF.] module uses linearly increasing dilation rates in convolutional layers to extract multi-resolution features.
Figure 5. Qualitative comparison between our model and SUFT [24]. The visualizations shown are for the [Forumla omitted. See PDF.] case. Our model creates more complete depth maps as seen in (c) for rows 1 and 2. In (c), row 3 shows that our model creates sharper edges with minimal bleeding. Also, in (c), row 4 the proposed model creates less smoothing with less bleeding. (Colormap chosen for better visualization. Better seen in full-screen, with zoom-in options).
Results on the NYU v2 data set.
Method | Bicubic | DG [ | SVLRM [ | DKN [ | FDSR [ | SUFT [ | CTKT [ | JIIF [ | IGAF |
---|---|---|---|---|---|---|---|---|---|
×4 | 8.16 | 1.56 | 1.74 | 1.62 | 1.61 | 1.14 | 1.49 | 1.37 | 1.12 |
×8 | 14.22 | 2.99 | 5.59 | 3.26 | 3.18 | 2.57 | 2.73 | 2.76 | 2.48 |
×16 | 22.32 | 5.24 | 7.23 | 6.51 | 5.86 | 5.08 | 5.11 | 5.27 | 5.00 |
Results on the RGB-D-D data set.
Method | Bicubic | DJFR [ | PAC [ | DKN [ | FDKN [ | FDSR [ | JIIF [ | SUFT [ | IGAF |
---|---|---|---|---|---|---|---|---|---|
×4 | 2.00 | 3.35 | 1.25 | 1.30 | 1.18 | 1.16 | 1.17 | 1.20 | 1.08 |
×8 | 3.23 | 5.57 | 1.98 | 1.96 | 1.91 | 1.82 | 1.79 | 1.77 | 1.69 |
×16 | 5.16 | 7.99 | 3.49 | 3.42 | 3.41 | 3.06 | 2.87 | 2.81 | 2.69 |
Results on the Lu data set.
Method | Bicubic | DMSG [ | DG [ | DJF [ | DJFR [ | PAC [ | JIIF [ | DKN [ | IGAF |
---|---|---|---|---|---|---|---|---|---|
×4 | 2.42 | 2.30 | 2.06 | 1.65 | 1.15 | 1.20 | 0.85 | 0.96 | 0.82 |
×8 | 4.54 | 4.17 | 4.19 | 3.96 | 3.57 | 2.33 | 1.73 | 2.16 | 1.68 |
×16 | 7.38 | 7.22 | 6.90 | 6.75 | 6.77 | 5.19 | 4.16 | 5.11 | 4.14 |
Results on the “real-world manner” RGB-D-D data set.
Method | Bicubic | DJF [ | DJFR [ | FDKN [ | DKN [ | FDSR [ | JIIF [ | SUFT [ | IGAF |
---|---|---|---|---|---|---|---|---|---|
“real-world manner” | 9.15 | 7.90 | 8.01 | 7.50 | 7.38 | 7.50 | 8.41 | 7.17 | 7.01 |
Results on the Middlebury data set.
Method | Bicubic | PAC [ | DKN [ | FDKN [ | CUNet [ | JIIF [ | SUFT [ | FDSR [ | IGAF |
---|---|---|---|---|---|---|---|---|---|
×4 | 2.28 | 1.32 | 1.23 | 1.08 | 1.10 | 1.09 | 1.20 | 1.13 | 1.01 |
×4 | 3.98 | 2.62 | 2.12 | 2.17 | 2.17 | 1.82 | 1.76 | 2.08 | 1.73 |
×16 | 6.37 | 4.58 | 4.24 | 4.50 | 4.33 | 3.31 | 3.29 | 4.39 | 3.24 |
Demonstrating the importance of the
Fusion Method | Addition | Concatenation | IGAF |
---|---|---|---|
×4 | 1.23 | 1.22 | 1.12 |
Results on the NYU v2 data set.
Test | Relocated Skip | Extra IGAF | Without | One Layer | Without | Full Model |
---|---|---|---|---|---|---|
×4 | 1.14 | 1.14 | 1.17 | 1.15 | 1.14 | 1.12 |
References
1. Huang, A.S.; Bachrach, A.; Henry, P.; Krainin, M.; Maturana, D.; Fox, D.; Roy, N. Visual odometry and mapping for autonomous flight using an RGB-D camera. Proceedings of the Robotics Research: The 15th International Symposium ISRR; Flagstaff, AZ, USA, 28 August–1 September 2011; Springer: Cham, Switzerland, 2017; pp. 235-252.
2. Stowers, J.; Hayes, M.; Bainbridge-Smith, A. Altitude control of a quadrotor helicopter using depth map from Microsoft Kinect sensor. Proceedings of the 2011 IEEE International Conference on Mechatronics; Istanbul, Turkey, 13–15 April 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 358-362.
3. Melchiorre, M.; Scimmi, L.S.; Pastorelli, S.P.; Mauro, S. Collison avoidance using point cloud data fusion from multiple depth sensors: A practical approach. Proceedings of the 2019 23rd International Conference on Mechatronics Technology (ICMT); Salerno, Italy, 23–26 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1-6.
4. Gao, X.; Uchiyama, Y.; Zhou, X.; Hara, T.; Asano, T.; Fujita, H. A fast and fully automatic method for cerebrovascular segmentation on time-of-flight (TOF) MRA image. J. Digit. Imaging; 2011; 24, pp. 609-625. [DOI: https://dx.doi.org/10.1007/s10278-010-9326-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20824304]
5. Penne, J.; Höller, K.; Stürmer, M.; Schrauder, T.; Schneider, A.; Engelbrecht, R.; Feußner, H.; Schmauss, B.; Hornegger, J. Time-of-flight 3-D endoscopy. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Marrakesh, Morocco, 7–11 October 2009; Springer: Cham, Switzerland, 2009; pp. 467-474.
6. Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Ji, X. Guided depth map super-resolution: A survey. ACM Comput. Surv.; 2023; 55, pp. 1-36. [DOI: https://dx.doi.org/10.1145/3584860]
7. Yang, Q.; Yang, R.; Davis, J.; Nistér, D. Spatial-depth super resolution for range images. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition; Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1-8.
8. Riemens, A.; Gangwal, O.; Barenbrug, B.; Berretty, R.P. Multistep joint bilateral depth upsampling. Proceedings of the Visual Communications and Image Processing; San Jose, CA, USA, 20–22 January 2009; SPIE: Bellingham, WA, USA, 2009; Volume 7257, pp. 192-203.
9. Liu, M.Y.; Tuzel, O.; Taguchi, Y. Joint geodesic upsampling of depth images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Portland, OR, USA, 23–28 June 2013; pp. 169-176.
10. Lo, K.H.; Wang, Y.C.F.; Hua, K.L. Edge-preserving depth map upsampling by joint trilateral filter. IEEE Trans. Cybern.; 2017; 48, pp. 371-384. [DOI: https://dx.doi.org/10.1109/TCYB.2016.2637661] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28129196]
11. Sun, Z.; Han, B.; Li, J.; Zhang, J.; Gao, X. Weighted guided image filtering with steering kernel. IEEE Trans. Image Process.; 2019; 29, pp. 500-508. [DOI: https://dx.doi.org/10.1109/TIP.2019.2928631] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31329117]
12. Qiao, Y.; Jiao, L.; Li, W.; Richardt, C.; Cosker, D. Fast, High-Quality Hierarchical Depth-Map Super-Resolution. Proceedings of the 29th ACM International Conference on Multimedia; Virtual Event, 20–24 October 2021; pp. 4444-4453.
13. Diebel, J.; Thrun, S. An application of markov random fields to range sensing. Adv. Neural Inf. Process. Syst.; 2005; 18, pp. 291-298.
14. Ferstl, D.; Reinbacher, C.; Ranftl, R.; Rüther, M.; Bischof, H. Image guided depth upsampling using anisotropic total generalized variation. Proceedings of the IEEE International Conference on Computer Vision; Sydney, Australia, 1–8 December 2013; pp. 993-1000.
15. Newcombe, R.A.; Fox, D.; Seitz, S.M. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA, 7–12 June 2015; pp. 343-352.
16. Park, J.; Kim, H.; Tai, Y.W.; Brown, M.S.; Kweon, I.S. High-quality depth map upsampling and completion for RGB-D cameras. IEEE Trans. Image Process.; 2014; 23, pp. 5559-5572. [DOI: https://dx.doi.org/10.1109/TIP.2014.2361034] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25291793]
17. Riegler, G.; Rüther, M.; Bischof, H. Atgv-net: Accurate depth super-resolution. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14; Springer: Cham, Switzerland, 2016; pp. 268-284.
18. Jiang, Z.; Yue, H.; Lai, Y.K.; Yang, J.; Hou, Y.; Hou, C. Deep edge map guided depth super resolution. Signal Process. Image Commun.; 2021; 90, 116040. [DOI: https://dx.doi.org/10.1016/j.image.2020.116040]
19. Song, X.; Dai, Y.; Qin, X. Deep depth super-resolution: Learning depth super-resolution using deep convolutional neural network. Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision; Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part IV 13 Springer: Cham, Switzerland, 2017; pp. 360-376.
20. Song, X.; Dai, Y.; Zhou, D.; Liu, L.; Li, W.; Li, H.; Yang, R. Channel attention based iterative residual learning for depth map super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, DC, USA, 14–19 June 2020; pp. 5631-5640.
21. Ye, X.; Sun, B.; Wang, Z.; Yang, J.; Xu, R.; Li, H.; Li, B. Depth super-resolution via deep controllable slicing network. Proceedings of the 28th ACM International Conference on Multimedia; Virtual Event, 12–16 October 2020; pp. 1809-1818.
22. Huang, L.; Zhang, J.; Zuo, Y.; Wu, Q. Pyramid-structured depth map super-resolution based on deep dense-residual network. IEEE Signal Process. Lett.; 2019; 26, pp. 1723-1727. [DOI: https://dx.doi.org/10.1109/LSP.2019.2944646]
23. He, L.; Zhu, H.; Li, F.; Bai, H.; Cong, R.; Zhang, C.; Lin, C.; Liu, M.; Zhao, Y. Towards fast and accurate real-world depth super-resolution: Benchmark dataset and baseline. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Virtual, 19–25 June 2021; pp. 9229-9238.
24. Shi, W.; Ye, M.; Du, B. Symmetric Uncertainty-Aware Feature Transmission for Depth Super-Resolution. Proceedings of the 30th ACM International Conference on Multimedia; Lisboa, Portugal, 10–14 October 2022; pp. 3867-3876.
25. Tang, J.; Chen, X.; Zeng, G. Joint implicit image function for guided depth super-resolution. Proceedings of the 29th ACM International Conference on Multimedia; Virtual, 20–24 October 2021; pp. 4390-4399.
26. Sun, B.; Ye, X.; Li, B.; Li, H.; Wang, Z.; Xu, R. Learning scene structure guidance via cross-task knowledge transfer for single depth super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Virtual, 19–25 June 2021; pp. 7792-7801.
27. Tang, Q.; Cong, R.; Sheng, R.; He, L.; Zhang, D.; Zhao, Y.; Kwong, S. Bridgenet: A joint learning network of depth map super-resolution and monocular depth estimation. Proceedings of the 29th ACM International Conference on Multimedia; Virtual, 20–24 October 2021; pp. 2148-2157.
28. Zhao, Z.; Zhang, J.; Xu, S.; Lin, Z.; Pfister, H. Discrete cosine transform network for guided depth map super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 18–24 June 2022; pp. 5697-5707.
29. Zhao, Z.; Zhang, J.; Gu, X.; Tan, C.; Xu, S.; Zhang, Y.; Timofte, R.; Van Gool, L. Spherical space feature decomposition for guided depth map super-resolution. arXiv; 2023; arXiv: 2303.08942
30. Xu, D.; Fan, X.; Gao, W. Multiscale Attention Fusion for Depth Map Super-Resolution Generative Adversarial Networks. Entropy; 2023; 25, 836. [DOI: https://dx.doi.org/10.3390/e25060836]
31. Wang, J.; Huang, Q. Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network. Appl. Sci.; 2023; 13, 8270. [DOI: https://dx.doi.org/10.3390/app13148270]
32. Song, X.; Zhou, D.; Li, W.; Dai, Y.; Liu, L.; Li, H.; Yang, R.; Zhang, L. WAFP-Net: Weighted Attention Fusion Based Progressive Residual Learning for Depth Map Super-Resolution. IEEE Trans. Multimed.; 2021; 24, pp. 4113-4127. [DOI: https://dx.doi.org/10.1109/TMM.2021.3118282]
33. Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Chen, Z.; Ji, X. High-resolution depth maps imaging via attention-based hierarchical multi-modal fusion. IEEE Trans. Image Process.; 2021; 31, pp. 648-663. [DOI: https://dx.doi.org/10.1109/TIP.2021.3131041] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34878976]
34. Tragakis, A.; Kaul, C.; Murray-Smith, R.; Husmeier, D. The fully convolutional transformer for medical image segmentation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Waikoloa, HI, USA, 2–7 January 2023; pp. 3660-3669.
35. Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision; Florence, Italy, 7–13 October 2012; Proceedings, Part V 12 Springer: Cham, Switzerland, 2012; pp. 746-760.
36. Hirschmuller, H.; Scharstein, D. Evaluation of cost functions for stereo matching. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition; Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1-8.
37. Scharstein, D.; Pal, C. Learning conditional random fields for stereo. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition; Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1-8.
38. Lu, S.; Ren, X.; Liu, F. Depth enhancement via low-rank matrix completion. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Columbus, OH, USA, 23–28 June 2014; 23–28 June pp. 3390-3397.
39. Gu, S.; Zuo, W.; Guo, S.; Chen, Y.; Chen, C.; Zhang, L. Learning dynamic guidance for depth image enhancement. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 3769-3778.
40. Pan, J.; Dong, J.; Ren, J.S.; Lin, L.; Tang, J.; Yang, M.H. Spatially variant linear representation models for joint filtering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 1702-1711.
41. Kim, B.; Ponce, J.; Ham, B. Deformable kernel networks for guided depth map upsampling. arXiv; 2019; arXiv: 1903.11286
42. Li, Y.; Huang, J.B.; Ahuja, N.; Yang, M.H. Joint image filtering with deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2019; 41, pp. 1909-1923. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2890623] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30605094]
43. Su, H.; Jampani, V.; Sun, D.; Gallo, O.; Learned-Miller, E.; Kautz, J. Pixel-adaptive convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 11166-11175.
44. Hui, T.W.; Loy, C.C.; Tang, X. Depth map super-resolution by deep multi-scale guidance. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14 Springer: Cham, Switzerland, 2016; pp. 353-369.
45. Li, Y.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep joint image filtering. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14 Springer: Cham, Switzerland, 2016; pp. 154-169.
46. Deng, X.; Dragotti, P.L. Deep convolutional neural network for multi-modal image restoration and fusion. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 43, pp. 3333-3348. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.2984244] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32248098]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Accurate depth estimation is crucial for many fields, including robotics, navigation, and medical imaging. However, conventional depth sensors often produce low-resolution (LR) depth maps, making detailed scene perception challenging. To address this, enhancing LR depth maps to high-resolution (HR) ones has become essential, guided by HR-structured inputs like RGB or grayscale images. We propose a novel sensor fusion methodology for guided depth super-resolution (GDSR), a technique that combines LR depth maps with HR images to estimate detailed HR depth maps. Our key contribution is the Incremental guided attention fusion (IGAF) module, which effectively learns to fuse features from RGB images and LR depth maps, producing accurate HR depth maps. Using IGAF, we build a robust super-resolution model and evaluate it on multiple benchmark datasets. Our model achieves state-of-the-art results compared to all baseline models on the NYU v2 dataset for
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details


1 School of Physics and Astronomy, University of Glasgow, Glasgow G12 8QQ, UK;
2 School of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK;