Content area
Neural radiance fields (NeRF), which encode a scene into a neural representation, have demonstrated impressive novel view synthesis quality on single object and small regions of space. However, when faced with urban outdoor environments, NeRF is limited by the capacity of a single MLP and insufficient input views, leading to incorrect geometries that hinder the production of realistic renderings. In this paper, we present MVSRegNeRF, an extension of neural radiance fields focused on large-scale autonomous driving scenario. We employ traditional patch-match based Multi-view stereo (MVS) method to generate dense depth maps, which we utilize to regulate the geometry optimization of NeRF. We also integrate multi-resolution hash encodings into our neural scene representation to accelerate the training process. Thanks to the relatively precise geometry constraint of our approach, we achieve high-quality novel view synthesis on real-world large-scale street scene. Our experiments on the KITTI-360 dataset demonstrate that MVSRegNeRF outperforms the state-of-the-art methods in Novel View Appearance Synthesis tasks.
Introduction
Inferring the 3D structure of a scene from 2D information is one of the core tasks of computer vision. This task has a wide range of applications, including gaming, film, surveying and mapping, positioning, navigation, autonomous driving, virtual/augmented reality, and industrial manufacturing. Novel view synthesis is particularly intriguing as it enables the creation of a high-fidelity digital 3D world using only a readily available and inexpensive mobile device.
Recently, with the success of deep learning, coordinate-based neural networks have become a powerful tool to represent the scene as an implicit function. Advancements in neural rendering such as Neural Radiance Fields (NeRF) have made it possible to generate highly realistic images of a scene from any viewpoint from a set of posed images [1]. NeRF have demonstrated remarkable dominance in the field of photorealistic novel view synthesis by encoding the volumetric density and radiances of a scene within the weights of a coordinate-based multilayer perceptron (MLP), which can be optimized with differentiable volume rendering techniques using 2D images as supervision. Prior research has focused mainly on reconstructing single object or small bounded indoor scene with a large number of images from multiple perspectives. However, when faced with real-world autonomous driving scenarios, NeRF often struggles to produce high-quality results due to limited multi-layer perceptron (MLP) capacity and insufficient coverage of viewpoints. This can result in blurry renderings or artifacts in the synthesized views.
Fig. 1 [Images not available. See PDF.]
Images of color and depth rendered by NeRF of training views. Although NeRF can forcibly fit the color values of the training images, the depth map still has many artifacts, harming the performance of novel view synthesis
The 3D reconstruction task of large-scale scenes, such as street views, is of significant importance in the field of autonomous driving and can support a variety of downstream tasks, including closed-loop robotic simulations [2], semantic segmentation generation, and space occupancy analysis [3]. At the same time, novel view synthesis of street scenes has the potential to enable realistic scene roaming in virtual reality and has promising applications in fields such as maps and video games. Compared with single objects or small-scale indoor scenes, the reconstruction of large-scale street scenes poses greater challenges for NeRF. Street scenes often feature a large number of objects with complex geometry and color, which increases the difficulty of NeRF optimization during the training progress. Additionally, autonomous driving datasets (e.g., KITTI-360 [4]) usually utilize only a limited number of perspective cameras facing forward, which makes the perspective of each camera not fall on a central part. This makes it difficult for each part of the scenes to be covered by multiple cameras. Since it is a ill-posed problem to simultaneously reconstruct the geometric and color information of the scene based on only 2-dimension images, though NeRF can forcibly fit the color value of the training images when the viewing angle of the scene is insufficient, it cannot optimize the correct scene geometry, as shown in Fig. 1. This makes NeRF produce incorrectly renderings of novel view.
Recently, several methods have improved the performance of NeRF in few-shot scenarios when only a limited number of views are available. Incorporating geometry priors such as depth priors into NeRF has been proven effective in previous work. Specifically, depth-supervised NeRF [5] exploited sparse 3D points generated by Structure-from-Motion (SfM) to regularize the geometry learned by NeRF and enables better novel view synthesis results on DTU and LLFF datasets. Dense depth priors NeRF [6] further proposes to use depth completion on the SfM sparse depth to generate dense depth maps alongside a per-pixel uncertainty estimate of those depths, thereby enabling data-efficient novel view synthesis on challenging indoor scenes. Meanwhile, urban radiance fields [7] leverage sparse LiDAR information to improve performance on street view data.
Although the methods previously mentioned have improved the performance of NeRF in certain scenarios, there are still many limitations when it is used in autonomous driving scenarios: To be more specific, SfM used by Depth-supervised NeRF can only generate sparse point clouds, its advantage lies in providing effective supervision under extremely sparse view settings (e.g., 3-view), especially in single-object scenes. However, the street scenes in autonomous driving are large and contain many low-texture areas, resulting in very sparse SfM reconstruction results, leading to limited improvement of the rendering quality. Although dense depth priors NeRF can provide dense supervision on indoor datasets like ScanNet, its depth completion network requires luxurious pre-training on large-scale RGBD datasets. When transitioning from indoor scenes to outdoor scenes, its depth completion network pre-trained on ScanNet suffers from severe domain gap, resulting in completely inaccurate depth prediction results which may even harm the NeRF training. Re-training the depth completion network on autonomous driving scenarios may solve this problem, but many current autonomous driving datasets do not have ground-truth depth data, and such training is usually costly. Urban Radiance Field requires expensive LiDAR sensors to provide additional LiDAR data, limiting its practicality, while this paper mainly discusses using only RGB images without auxiliary data.
To address these issues, we aim to explore a better geometric prior that performs well in autonomous driving scenarios. In this paper, we introduce MVSRegNeRF (multi-view stereo-regulated NeRF), an add-on method that can be incorporated into any NeRF variant (e.g., InstantNGP [8]) to constrain the geometry optimization using relatively accurate dense depth maps obtained by traditional multi-view stereo methods. These methods have shown advantages when dealing with large-scale scenes. Since multiple cameras in an autonomous driving scene are typically placed in fixed relative positions, the MVS method can provide more reliable depth map calculations compared to NeRF under these settings. Compared with other baseline methods which also adopt depth priors, our proposed MVS-depth priors do not require any pre-training or additional LiDAR sensor data, yet it achieves superior results. Extensive experiments demonstrate that MVSRegNeRF enables NeRF to reconstruct the correct geometry on the KITTI-360 dataset, thereby improving the rendering results of novel views. Our method achieves state-of-the-art quality on the KITTI-360 Novel View Appearance Synthesis Leaderboard.
Related work
The most relevant related work can be divided into four categories:
Neural radiance fields (NeRF),
NeRF for large-scale scenes,
NeRF with geometric priors,
Multi-view stereo (MVS).
Neural radiance fields (NeRF)
NeRF represents a scene using a multi-layer perceptron (MLP) that maps positions and ray directions to volumetric densities and radiances. This representation is trained using a loss function that measures photometric consistency with respect to a set of RGB images with known poses. After training, NeRF can synthesize arbitrary novel views with volumetric rendering. NeRF has gained significant attention in recent years due to its state-of-the-art performance in novel view synthesis and have inspired many subsequent studies [9, 10–11]. The original NeRF model mainly focuses on single objects or small-scale scenes and requires training pictures to have enough perspectives to cover every angle of the scene, otherwise the reconstruction may arrive at degenerate representations. Another drawback of NeRF is the huge time consumption of training process. Several works have explored to train the NeRF with limited input images [12, 13, 14–15], accelerate the training speed of NeRF [16, 17, 18, 19–20], improve the robustness and quality of NeRF [21, 22], and extende nerf to achieve SDF-based high-precision surface reconstruction [23, 24]. NeRF have become one of the most compelling research directions in the field of 3D vision and have many practical applications.
NeRF for large-scale scenes
Although most NeRF-related work targets indoor areas, there is still some work to study NeRF in outdoor scenarios. NeRF++ [25] investigates the parameterization of unbounded scenes, we took the same approach to model the sky. NeRF-W [11] introduces several extensions to NeRF and successfully model the inconsistent appearance variations with additional appearance encoding. Block-NeRF [26] successfully recreated the entire city of San Francisco by dividing the scene into individual NeRF blocks. However, their datasets uses 12 cameras to provide sufficient camera coverage while only uses two forward facing cameras. Urban Radiance Fields [7] also investigated neural scene representations for street view, but they use lidar sensor to provide 3D point cloud supervision, while we use only 2D RGB images. Panoptic Neural Fields (PNF) [27] decompose a scene into a set of objects (things) and background (stuff) to model real-world dynamic scenes. This approach produces better results that surpass the original NeRF and mip-NeRF on the novel view synthesis task of the KITTI-360 dataset. Panoptic NeRF [28] leverage ground-truth 3D bounding primitives to train semantic NeRF on the KITTI-360 dataset and mainly focuses on 3D-to-2D semantic label transfer rather than novel view synthesis. Compared to these methods, out MVSRegNeRF achieves state-of-the-art quality across all novel view appearance synthesis tasks (50% and 90% drop rate) on KITTI-360 datasets.
NeRF with geometric priors
A few recent works have tried to incorporate geometric prior such as depth information into the training of neural implicit representation. NerfingMVS [29] trains a monocular depth prediction network to guide the NeRF sampling. Dense depth priors NeRF [6] employs depth completion on the SfM depth and additionally employs a depth loss to supervise the geometry recovered by NeRF. The aforementioned two methods require a large RGB-D dataset (e.g., ScanNet) to pre-train a deep prediction network while our methods does not require any pre-training. Depth-supervised NeRF [5] directly uses sparse depth information from SfM in the NeRF optimization. This method performs well on few-shot LLFF datasets but sparse depth cannot provide sufficient geometric constraints when migrated to large-scale scenes.
Apart from novel view synthesis, existing works show that geometric priors is also helpful in improving neural surface reconstruction. Geo-NeuS [30] uses sparse point clouds from SfM to directly supervise the Signed Distance Function (SDF). NeuRIS [31] demonstrates the power of monocular normal prior with off-the-shelf normal predictors on the reconstruction task of indoor scenes. MonoSDF [32] simultaneously uses monocular depth prior and normal prior to improve the reconstruction quality of indoor scenes.
Fig. 2 [Images not available. See PDF.]
Overview of our optimization pipeline. We run multi-view stereo on training images to estimate a dense depth map, which will be used to guide the optimization of radiance fields. More specifically, for a single ray generated from input views, fixed number of points are sampled along the ray. We feed the coordinates x of these points along with viewing direction d into radiance fields MLP to produce color (omitted here for simplicity) and volume density . Then we simultaneously generate pixel color and the ray’s expected termination depth by volume rendering. Since this rendering process is differentiable, the radiance fields MLP can be optimized with color and depth supervision
Multi-view stereo (MVS)
Multi-view stereo (MVS) is a classical method for recovering a dense scene representation from multi-view images. These methods usually exploit inter-image photometric consistency to estimate dense depth maps for each view. After that, all the depth maps are integrated into dense point clouds with photometric and geometric consistency checks used to filter out depth outliers. Finally, surface reconstruction methods such as Delaunay triangulation [33] and Poisson surface reconstruction [34] are used to obtain the surface in the form of triangular mesh.
In the field of 3D surface reconstruction, MVS methods have received less attention compared to recent neural implicit surface reconstruction methods like NeuS [23]. This is because MVS methods have a complex pipeline and are prone to accumulating errors in multiple stages, resulting in artifacts and missing parts in the reconstruction results. However, the optimization difficulty of NeuS is even higher than NeRF, requiring intensive views, and it only performs well in single object scenes. In situations like autonomous driving scenes where the viewpoint coverage is insufficient, it is impossible for NeuS to optimize a reasonable scene geometry, while MVS methods can still optimize an incomplete but relatively accurate depth map for each view.
Classical MVS methods such as COLMAP MVS typically leverage pair-wise matching or patch matching to estimate depth and normal of each pixel. In recent years, supervised multi-view stereo methods have flourished, thanks to the swift advancement of deep learning, and have shown impressive results across various benchmarks. MVSNet [35], a trailblazing end-to-end deep learning MVS architecture, utilizes the concept of cost volume in stereo matching. It transforms the features of source images into reference images using differentiable homography warping and then employs 3D convolutional neural networks (CNNs) to regularize the cost volume and predict the depth map for reference view. And follow-up works further improve the memory efficiency and reconstruction accuracy. However, these learning-based approaches necessitate expensive pre-training on datasets with 3D ground truth, which is difficult to acquire. Thus, they are only effective on specific datasets and cannot be easily migrated to other datasets. Therefore, in this work we choose the traditional PatchMatch-based MVS method to provide readily available geometric priors.
Method
In order to make NeRF more accurate in modeling the geometry of large-scale street scenes during training, we innovatively use depth maps obtained from traditional patch-match-based multi-view stereo framework to constrain the training process of NeRF. First, we provide a brief overview of the background of NeRF. Then, we introduce our method of adding a depth constraint from the MVS method into the NeRF using a depth-supervised loss.
Fig. 3 [Images not available. See PDF.]
A visualization of multi-layer perceptron used in NeRF. All layers are standard fully-connected layers. The positional encoding of the location and viewing direction are passed to the neural network as shown, and then, the volume density and RGB value are predicted. Note that the volume density depends only on the 3D coordinates
Neural radiance fields revisited
A radiance field is a continuous function f that maps a 3D coordinate and viewing direction to a volume density and color value . NeRF parameterize this function using a multi-layer perceptron (MLP) as shown in Fig. 3. Formally,
1
where represents an MLP with parameters , which is used to parameterize the mapping function. is the predefined positional encoding applied to and that maps the inputs to a higher dimension space:2
where the hyper-parameter L indicates the highest frequency used in the mapping and can be used to control the smoothness of the scene representation. The principle behind this is the fact that deep networks are biased toward learning lower-frequency functions [36]. It has been proved that mapping the inputs to a higher-dimensional space using such Fourier positional encoding enables the network to learn high-frequency details, which is essential for rendering photo-realistic images and reconstructing complex geometry. Given a neural radiance field, a pixel at image coordinate is rendered by casting a ray from the camera projection center through the pixel along direction . Then, a sampling strategy is used to sample a fixed number of 3D points on this ray. We estimate the color and volume density at each sample point by sending the 3D coordinate and viewing direction into the radiance field MLP . Finally, the color value of this pixel is calculated using alpha compositing for given near and far bounds and :3
where t is the sampling distance from the ray origin, and denotes the accumulated transmittance along the ray. Here, and indicate the density and color predicted by , respectively. In practice, NeRF approximate these integrals using quadrature:4
where is the opacity of the i-th ray segment. is the distance between adjacent samples, and is the accumulated transmittance.Equation (3) is a key component of NeRF called volume rendering equation [37], which establishes a bridge between 2D and 3D. This equation makes it possible to optimize 3D representations using 2D images as supervision. During training procedure, the parameters can be optimized by minimizing the residual between synthesized and ground-truth observed images, typically using the mean-squared error (MSE):
5
Here, is the ground-truth pixel clolor of a single ray sampled from the input image. With this 2D supervision, the volumes density and color of the scene are optimized to explain the multi-view training images. After convergence, high-fidelity images can be rendered from arbitrary novel view with the same volume rendering strategy.Depth from volume rendering of NeRF
Equation (3) defines how a pixel’s color is rendered from NeRF. Previous work demonstrate that the above volume rendering scheme can generate not only pixel color, but also geometric properties such as depth. Specifically, the expected depth is calculated as:
6
The intuitive explanation of Eq. (6) is that can be viewed as the normalized weight w of a sampling point on the ray. Ideally, the weight of the sampling point on the surface of the object should be equal to 1 and the other points should be 0, and in this case the weighted sum of the sampling distance t is the depth value of the pixel from which the ray is emitted. When the volume density of NeRF is incorrect, it is obvious that the rendered depth map is also inaccurate. The idea of our method is that if we can use a relatively accurate depth map to supervise the depth obtained by volume rendering, the back-propagation algorithm can optimize the volume density of NeRF, thereby improving the quality of novel view synthesis.Depth from multi-view stereo
Traditional MVS method uses the photometric consistency between multiple views to estimate the depth map at an accurate scale. Specifically, for a reference image and a series of source images , MVS methods predict the depth and normal for each pixel in , where denotes the index of pixel. Concurrently, given that a specific pixel within the reference image may be occluded and rendered invisible in the source image due to the presence of other objects, it becomes imperative to compute the occlusion indicator variable for each pixel in the reference image across all source images. The value of can be either 0 or 1. signifies that pixel l is invisible in source image m, whereas indicates that the pixel is visible. Then, the optimization of , and based on photometric and geometric consistencies between matched patches in and can be formulated as a probabilistic graphical model, typically solved with generalized expectation–maximization (GEM) algorithm. The optimization problem can be defined as follows:
7
8
where is the photometric consistency between the reference image and the source image, and is the forward-backward reprojection error to evaluate the geometric consistency. is capped by a predefined to reduce the impact of occlusion.We will not delve into the specific details of the optimization algorithm in this discussion, as our primary objective is to leverage the results from multi-view stereo (MVS) to guide the training process of neural radiance fields.
Optimization with depth constraint
We found that traditional MVS methods mentioned in Sect. 3.3 perform well on large-scale scene datasets like KITTI-360. For this reason, we adopt Colmap MVS [38], an open-source framework that jointly estimates the depth and normal information with photometric and geometric priors, to calculate a depth map for each input RGB image as a dense depth prior. Formally, we use Huber loss on the single ray depth output calculated by Eq. (6):
9
Here is a hyperparameter that specifies the threshold at which to change between delta-scaled L1 and L2 loss.Although MVS methods have filtered out obvious outliers in the post-processing stage through multi-view consistency, not every depth value in its predicted depth map is reliable. To mitigate the impact of erroneous depth values, we only use depth values within the range of as supervision, where and are predefined hyperparameters. The reason is that the depth predictions for areas that are too close or too far are inaccurate. Concurrently, we propose a simple yet effective depth loss clamp strategy. Specifically, we set all loss values greater than 0.5 to 0, and clamp all loss values greater than 0.1 to 0.1. Formally:
10
This strategy eliminates the impact of incorrect depth values, preserving the supervision of relatively accurate depth values. We will demonstrate the effectiveness of this strategy in our ablation experiments.During a training batch, we sample a fixed number of rays and calculate the overall loss:
11
where stands for a set of input rays and is a hyperparameter balancing color and depth supervision.Fig. 4 [Images not available. See PDF.]
Crop of rendered RGB for test views. Our results have better visual quality in detail
Experiments and results
We evaluate our method on the KITTI-360 dataset by submitting the rendering results to the official novel view appearance synthesis leaderboard. Next, we perform qualitative and quantitative comparisons against state-of-the-art methods. After that, we give ablation experiments to demonstrate the effectiveness of mvs depth priors. The online Leaderboard results can be viewed at https://www.cvlibs.net/datasets/kitti-360/leaderboard_nvs.php?task=nvs_rgb.
Datasets
KITTI-360 is a large-scale datasets contains over 320k images with a resolution of and 100k laser scans in a driving distance of 73.7km. This dataset provides two novel view appearance synthesis tasks, with 50% and 90% drop rate, respectively. For the task with 50% drop rate, 5 static scenes with a driving distance of 50 ms are selected and the interval between each frame is 0.8 ms. As for the task with 90% drop rate, 10 static scenes with a driving distance of 50 ms are selected, and the interval between each frame is 4.0 ms, making this task more tough due to extremely sparse input views. All methods on the evaluation table are ranked according to the peak signal-to-noise ratio (PSNR).
Implementation details
We adopt the same network architecture of InstantNGP, a variant of NeRF using multiresolution hash encoding to significantly accelerate the training process, yielding impressive quality on the novel view synthesis task. The aabb-scale parameter is set to 16 to ensure that the entire scene is within the bounding box. The multi-resolution hash grids are set with 16 levels, a base resolution of 16, and a maximum resolution of 2048. This is the same as the default settings in InstantNGP. For each scene, we train the model for 35k iterations, which only takes about 5 min on a single NVIDIA RTX3080 GPU.
Table 1. Quantitative results on KITTI-360 (50% drop rate)
Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
PCL [4] | 12.81 | 0.576 | 0.549 |
PBNR [39] | 19.91 | 0.811 | 0.191 |
FVS [40] | 20.00 | 0.790 | 0.193 |
NeRF [1] | 21.18 | 0.779 | 0.343 |
mip-NeRF [9] | 21.54 | 0.778 | 0.365 |
PNF [27] | 22.07 | 0.820 | 0.221 |
Ours | 22.48 | 0.829 | 0.256 |
The bold values identify the best results in the table
Table 2. Quantitative results on KITTI-360 (90% drop rate)
Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
NeRF | 15.74 | 0.648 | 0.590 |
Ours | 17.20 | 0.702 | 0.424 |
The bold values identify the best results in the table
Fig. 5 [Images not available. See PDF.]
Rendered RGB for test views from KITTI-360. Our results have better visual quality in general
Fig. 6 [Images not available. See PDF.]
Rendered RGB for test views from KITTI-360 (90% drop rate). Our results have significantly better visual quality
Comparison on KITTI-360 official leaderboard
We compare our method with the following published methods:
non-NeRF model: PCL [4], PBNR [39] and FVS [40].
NeRF-based model: NeRF [1], mip-NeRF [9] and PNF [27]. All results are adopted from the KITTI-360 official leaderboard (50% drop rate).
Qualitative comparisons of 50% drop rate setting are shown in Fig. 5. Our method is visually better than other methods with fine details, especially when dealing with complex geometries like bushes and woods. Figure 4 shows a more detailed comparison, from which it is evident that our method produces higher-quality synthesized images.
The advantages of our approach become even more apparent in the more challenging setting of 90% drop rate. As can be seen from Fig. 6, the synthesized images of NeRF are blurred and incorrect. Our MVSRegNeRF produce clearer and sharper images with more details thanks to its more accurate geometry learned by robust depth priors.
More rendering results for the official test set are available on the leaderboard website we mentioned before at the beginning of Sect. 4.
Ablation study
In order to evaluate the effectiveness of our proposed mvs depth priors and robust depth loss clamp strategy mentioned in Sect. 3.4, we conduct ablation experiments. Due to the submission policy of the KITTI-360 website, we manually select 4 scenes from KITTI-360 training data with an average travel distance of 0.8m (50% drop rate) between frames for ablation experiments, following Panoptic NeRF.
Since our method is based on InstantNGP, we choose three different settings: (1) InstantNGP with the default setting; (2) Ours without depth loss clamp; (3) Ours: InstantNGP with MVS depth priors and depth loss clamp.
As shown in Table 3, integrating depth priors significantly improves the rendering quality because it reduces ambiguities caused by lack of texture to fit more accurate scene geometry. Omitting the depth loss clamp strategy can lead to some performance degradation due to certain erroneous depth estimates, although it’s not particularly noticeable.
Figure 7 demonstrates how MVS improves the accuracy of scene geometry reconstruction. Evidently the depth map rendered by our method is more reasonable and smooth, leading to better quality of novel view synthesis.
Comparison with other depth priors
We also consider the supervision of other depth estimation methods proposed by previous works. The experiments prove that our MVS depth priors have significant superiority compared to existing methods. Specifically, we compare three different depth priors:
SGM-depth: dense depth map estimated by a classical binocular stereo matching method [43] from a rectified stereo image pair, proposed by Panoptic NeRF [28].
SfM-depth: sparse depth map obtained by running Structure-from-Motion on the training images, proposed by Depth-supervised NeRF [5].
Dense depth: dense depth map obtained by depth completion on the SfM depth with a pre-trained depth completion network, proposed by Dense depth priors NeRF [6].
It is observed that SGM may slightly enhance rendering quality compared to baseline, but the improvement is very limited. The reason is that SGM solely relies on two images for depth calculation, failing to effectively utilize multi-view information, which leads to inaccurate depth prediction.
SFM performs better than SGM, but it still falls short compared to our method. This is attributed to the fact that SFM provides relatively sparse depth supervision, whereas our MVS depth can offer dense depth supervision, thereby leading to superior performance. SfM’s advantage lies in its ability to provide relatively effective supervision even under extremely sparse view conditions (e.g., 3-view). In the context of autonomous driving, although the number of viewpoints is also sparse, it is sufficient for MVS methods to obtain reasonably accurate results. Therefore, in such scenarios, our proposed MVS supervision is more effective than SFM supervision. It can be inferred that as the number of viewpoints increases, the improvement brought about by SFM supervision begins to diminish, while MVS supervision begins to demonstrate its superiority.
Table 3. Ablation studies of each component of our method over 4 scenes. (PSNR)
Method | Scene1538 | Scene1728 | Scene1908 | Scene3353 | Mean |
|---|---|---|---|---|---|
InstantNGP | 23.99 | 24.45 | 22.81 | 26.57 | 24.46 |
Ours w/o depth loss clamp | 24.53 | 25.05 | 23.11 | 26.90 | 24.90 |
Ours | 24.62 | 25.33 | 23.16 | 26.92 | 25.01 |
The bold values identify the best results in the table
Fig. 7 [Images not available. See PDF.]
Rendered color and depth comparisons with InstantNGP. Our results have higher visual quality, with more accurate and smoother depth maps, implying correct reconstruction of the scene geometry
As for the dense depth prior, using its pre-trained depth completion model leads to a significant performance decline. This is because its depth completion network is pre-trained on the indoor scene dataset ScanNet. When transferred to the outdoor scene dataset KITTI-360, due to the inconsistency in scene scales and image resolution, a severe domain gap causes it to be unable to predict the correct depth. As shown in Fig. 9, the completed depth map is not correct, which damages the reconstruction results when used for NeRF supervision.
Figure 8 shows a more detailed comparison of rendered color and depth with different depth priors. Obviously, our MVS depth supervision produces more accurate and smoother depth maps, suggesting that it has enhanced NeRF’s ability to learn geometry. This leads to better results in novel view synthesis.
Discussion
Why using traditional multi-view stereo?
Compared with deep learning-based monocular depth prediction networks, MVS uses multi-view constraints to achieve higher accuracy while avoiding scale estimation.
Compared to SfM and SGM, the MVS method provides more accurate and dense depth supervision, thereby achieving better performance.
At the same time, using the Patch-match MVS method does not require any pre-training on large-scale RGBD dataset, which makes it easier to use than the supervised DeepMVS method and depth completion method, especially for a brand new dataset.
In general, our proposed MVS supervision approach strikes a good balance between usability and accuracy, enabling easy adaptation to any dataset and baseline.
Table 4. Quantitative comparison with other depth priors. (PSNR)
Depth priors | Scene1538 | Scene1728 | Scene1908 | Scene3353 | Mean |
|---|---|---|---|---|---|
RGB-Only | 23.99 | 24.45 | 22.81 | 26.57 | 24.46 |
SGM | 23.79 | 24.60 | 23.04 | 26.75 | 24.55 |
SfM | 24.53 | 25.00 | 23.20 | 26.79 | 24.88 |
Dense depth | 22.83 | 20.79 | 19.13 | 25.37 | 22.03 |
Ours | 24.62 | 25.33 | 23.16 | 26.92 | 25.01 |
Fig. 8 [Images not available. See PDF.]
Rendered color and depth comparisons with other depth priors. Our results have higher visual quality, with more accurate and smoother depth maps
Fig. 9 [Images not available. See PDF.]
Dense depth map predicted by depth completion network. The domain gap causes the depth completion network pre-trained on the ScanNet dataset to make incorrect predictions in outdoor scenes
Why adopting InstantNGP as baseline model?
During our experiments, we found that InstantNGP not only speeds up the training process thousands of times, but also achieves better reconstruction results compared to NeRF. The information of each point in space is stored in its surrounding grid, which reduces the burden of MLP. At the same time, InstantNGP dynamically skips the empty part of the space during training, which is especially useful for sparse large-scale scenes.
Limitations
The performance of our model depends on the quality of the MVS methods, which may fail in some scenarios. More efficient filtering strategies to handle inaccurate depth priors are thus a vital future direction. Our method can only reconstruct static scenes, which limits its use in the field of real-time autonomous driving. Fortunately, thanks to the development of dynamic-NeRF, this issue may be resolved in the future. At the same time, our method mainly explores the performance of MVS supervision in outdoor scenes. However, the performance of MVS supervision in various types of scenes (e.g., indoor, single object) is still worth further study. In addition, it also has the potential to enhance the performance of neural surface reconstruction methods like NeuS, which is an interesting direction for future research.
Conclusion
In this work, we propose a patch-match based multi-view stereo-guided optimization framework of neural radiance field, which can integrate depth priors into the training process of NeRF. This way enables NeRF to utilize prior knowledge to reconstruct a large-scale scene with fine details. Our model is the first to use traditional MVS methods to directly supervise the training process of NeRF. Extensive experiments demonstrate that our method outperforms other NeRF methods with geometric priors in multiple aspects, resulting in state-of-the-art performance for novel view synthesis task of KITTI-360.
Author contributions
FB performed the experiments and wrote the main manuscript text. SX and RY performed verification of experiments and other research outputs. LM provided the financial support for the project leading to this publication. All authors reviewed the manuscript.
Declarations
Conflict of interest
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV. (2020)
2. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on robot learning. PMLR; pp. 1–16 (2017)
3. Mescheder, L.M., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3D reconstruction in function space. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2018, 4455–4465 (2019)
4. Liao, Y., Xie, J., Geiger, A.: KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2D and 3D. Pattern Analysis and Machine Intelligence (PAMI) (2022)
5. Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised nerf: Fewer views and faster training for free. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, 12872–12881 (2022)
6. Roessle, B., Barron, J.T., Mildenhall, B., Srinivasan, P.P., Nießner, M.: Dense depth priors for neural radiance fields from sparse input views. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, 12882–12891 (2022)
7. Rematas, K., Liu, A., Srinivasan, P.P., Barron, J.T., Tagliasacchi, A., Funkhouser, T.A., et al.: Urban radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, 12922–12932 (2022)
8. Müller, T; Evans, A; Schied, C; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG); 2022; 41, pp. 1-15. [DOI: https://dx.doi.org/10.1145/3528223.3530127]
9. Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 5835–5844 (2021a)
10. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 5460–5469 (2021b)
11. Martin-Brualla, R., Radwan, N., Sajjadi, M.S.M., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: Neural radiance fields for unconstrained photo collections. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 7206–7215 (2021)
12. Jain, A., Tancik, M., Abbeel, P.: Putting nerf on a diet: Semantically consistent few-shot view synthesis. In: IEEE/CVF International Conference on Computer Vision (ICCV) 2021, 5865–5874 (2021)
13. Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., et al.: Mvsnerf: fast generalizable radiance field reconstruction from multi-view stereo. In: IEEE/CVF International Conference on Computer Vision (ICCV) 2021, 14104–14113 (2021)
14. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from one or few images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 4576–4585 (2021)
15. Rebain, D., Matthews, M.J., Yi, K.M., Lagun, D., Tagliasacchi, A.: Lolnerf: Learn from one look. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, 1548–1557 (2022)
16. Reiser, C., Peng, S., Liao, Y., Geiger, A.: Kilonerf: Speeding up neural radiance fields with thousands of tiny MLPS. In: IEEE/CVF International Conference on Computer Vision (ICCV) 2021, 14315–14325 (2021)
17. Hu, T., Liu, S., Chen, Y., Shen, T., Jia, J.: Efficientnerf - efficient neural radiance fields. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, 12892–12901 (2022)
18. Sun, C., Sun, M., Chen, H.: Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In: CVPR. (2022a)
19. Sun, C., Sun, M., Chen, H.T.: Improved direct voxel grid optimization for radiance fields reconstruction. ArXiv (2022b) arxiv:2206.05085
20. Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In: European Conference on Computer Vision (ECCV). (2022)
21. Mildenhall, B., Hedman, P., Martin-Brualla, R., Srinivasan, P.P., Barron, J.T.: Nerf in the dark: high dynamic range view synthesis from noisy raw images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, 16169–16178 (2022)
22. Ma, L., Li, X., Liao, J., Zhang, Q., Wang, X., Wang, J., et al.: Deblur-nerf: neural radiance fields from blurry images. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, 12851–12860 (2022)
23. Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In: Neural Inf. Process. Syst. (2021)
24. Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: Thirty-Fifth Conference on Neural Information Processing Systems (2021)
25. Zhang, K., Riegler, G., Snavely, N., Koltun, V.: Nerf++: Analyzing and improving neural radiance fields (2020). arXiv preprint arXiv:2010.07492
26. Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., et al.: Block-nerf: Scalable large scene neural view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8248–8258 (2022)
27. Kundu, A., Genova, K., Yin, X., Fathi, A., Pantofaru, C., Guibas, L.J., et al.: Panoptic neural fields: A semantic object-aware neural scene representation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, 12861–12871 (2022)
28. Fu, X., Zhang, S.W., Chen, T., Lu, Y., Zhu, L., Zhou, X., et al.: Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In: 2022 International Conference on 3D Vision (3DV), 1–11 (2022a)
29. Wei, Y., Liu, S., Rao, Y., Zhao, W., Lu, J., Zhou, J.: Nerfingmvs: guided optimization of neural radiance fields for indoor multi-view stereo. In: IEEE/CVF International Conference on Computer Vision (ICCV) 2021, 5590–5599 (2021)
30. Fu, Q; Xu, Q; Ong, YS; Tao, W. Geo-neus: geometry-consistent neural implicit surfaces learning for multi-view reconstruction. Adv. Neural. Inf. Process. Syst.; 2022; 35, pp. 3403-3416.
31. Wang, J.C., Wang, P., Long, X., Theobalt, C., Komura, T., Liu, L., et al.: Neuris: Neural reconstruction of indoor scenes using normal priors. In: European Conference on Computer Vision. (2022), https://api.semanticscholar.org/CorpusID:250088904
32. Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. In: Koyejo, S, Mohamed, S, Agarwal, A, Belgrave, D, Cho, K, Oh, A, editors. Advances in Neural Information Processing Systems; vol. 35. Curran Associates, Inc. p. 25018–25032 (2022). https://proceedings.neurips.cc/paper_files/paper/2022/file/9f0b1220028dfa2ee82ca0a0e0fc52d1-Paper-Conference.pdf
33. Labatut, P., Pons, J.P., Keriven, R.: Efficient multi-view reconstruction of large-scale scenes using interest points, delaunay triangulation and graph cuts. In: 2007 IEEE 11th International Conference on Computer Vision, 1–8 (2007) https://api.semanticscholar.org/CorpusID:940870
34. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the fourth Eurographics symposium on Geometry processing; vol. 7, p. 0 (2006)
35. Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstructured multi-view stereo. In: European Conference on Computer Vision. (2018), https://api.semanticscholar.org/CorpusID:4712004
36. Tancik, M., Srinivasan, P.P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. ArXiv Arxiv:2006.10739. https://api.semanticscholar.org/CorpusID:219791950 (2020)
37. Levoy, M. Efficient ray tracing of volume data. ACM Trans. Graph.; 1990; 9, pp. 245-261. [DOI: https://dx.doi.org/10.1145/78964.78965]
38. Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selection for unstructured multi-view stereo. In: European Conference on Computer Vision (ECCV). (2016)
39. Kopanas, G; Philip, J; Leimkühler, T; Drettakis, G. Point-based neural rendering with per-view optimization. Comput. Graph. Forum; 2021; 40, pp. 29-43. [DOI: https://dx.doi.org/10.1111/cgf.14339]
40. Riegler, G., Koltun, V.: Free view synthesis. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16. Springer; p. 623–640 (2020)
41. Wang, Z; Bovik, AC; Sheikh, HR; Simoncelli, EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.; 2004; 13,
42. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. p. 586–595 (2018)
43. Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell.; 2007; 30,
Copyright Springer Nature B.V. Jan 2025