MPVF: Multi-Modal 3D Object Detection Algorithm

Full text

Turn on search term navigation

1. Introduction

3D object detection is fundamental to cutting-edge applications such as autonomous driving [1], drones [2], and robotics [3]. These technologies depend on precise environmental perception to ensure that systems can accurately identify and locate surrounding objects, thereby facilitating safe navigation and operation. Autonomous driving, in particular, has been a major driving force behind the advancement of 3D object detection technologies. In autonomous driving systems, vehicles must continuously perceive their surroundings in real time, accurately detecting and tracking pedestrians, cyclists, vehicles, and other obstacles on the road. This real-time perception is essential for making precise driving decisions, ultimately ensuring safety and efficiency in dynamic traffic environments. Based on the type of input data, 3D object detection methods can be broadly classified into three categories: camera-based 3D object detection, LIDAR-based 3D object detection, and multi-modal fusion-based 3D object detection.

Camera-based 3D object detection methods [4,5] typically utilize monocular [4] or multi-view images [5] as input. While image data offers rich color and texture information, it lacks accurate depth information and 3D structural details, resulting in inherent depth deficiency [6]. Conversely, LIDAR-based 3D object detection methods [7,8] generally use laser radar point clouds as input. Although point cloud data provide precise depth and 3D structural information, they are characterized by low resolution and insufficient texture and color information. This deficiency can lead to misidentifications when the background and target objects exhibit similar geometric structures [6]. By integrating image and point cloud data, the accuracy and robustness of detection can be significantly enhanced in complex environments; thus, the exploration of multi-modal fusion techniques for 3D object detection is of paramount importance.

Within the realm of existing multi-modal fusion algorithms, the integration of camera and LIDAR point clouds is typically categorized according to the stage of fusion: early fusion, intermediate fusion, and late fusion. Early fusion methods [9,10] merge image information prior to the entry of point cloud data into the LIDAR detection pipeline, thereby enabling subsequent detection processes to fully leverage multi-modal information. However, these methods often necessitate preliminary 2D detection or semantic segmentation of images, which must then be mapped to the point clouds, resulting in increased computational costs and inference times. Furthermore, if alignment errors occur between images and point clouds, these errors may be exacerbated in subsequent 3D object detection stages, negatively impacting the accuracy of the detection results [11]. Late fusion methods [12,13], on the other hand, postpone the fusion of multi-modal data until the final detection stage, thereby enhancing computational efficiency; however, the lack of early interaction between modalities may lead to inadequate feature representation, thereby limiting detection performance [11].

In contrast, intermediate fusion methods [14,15] integrate image and point cloud features within the backbone network or at intermediate stages of the detector, facilitating the generation of finer feature representations and offering greater flexibility through adjustable fusion operators and network structures based on specific requirements. These methods also exhibit superior performance in 3D detection, which justifies the adoption of an intermediate fusion strategy in our proposed model. While MVX-Net [14] enhances detection performance through image feature fusion, the interaction between local and global features remains insufficient, which may lead to decreased detection accuracy when backgrounds and targets are similar. Moreover, MVX-Net demonstrates weak fine-grained feature extraction capabilities when detecting small and complex-shaped objects. EPNet [15] enhances point cloud features by utilizing image semantic information; however, it similarly suffers from inadequate interaction between local and global features in complex scenes, limiting fine-grained feature extraction. Such multimodal intermediate fusion methods exhibit insufficient interaction between local and global features, along with inadequate fine-grained feature extraction, thereby providing avenues for further optimization of multimodal intermediate fusion techniques.

To address the aforementioned challenges, we propose a multi-modal 3D object detection algorithm with pointwise and voxelwise fusion (MPVF), which enhances multi-modal feature interaction and optimizes feature extraction strategies to improve detection precision and robustness. Our principal contributions are as follows:

(1) We introduce a pointwise and voxelwise fusion (PVWF) module, which combines local features extracted by the pointwise fusion (PWF) module with global features derived from the voxelwise fusion (VWF) module. This integration enhances the interaction between multi-modal local and global features, improves the expressiveness of the fused features, and boosts the model’s performance in detecting small objects and navigating complex scenes.

(2) We design a more expressive feature extraction module, improved ResNet-101 and feature pyramid (IRFP), which comprises the improved ResNet-101 (IR) and feature pyramid (FP) modules. The IR module employs a group convolution strategy to incorporate high-level semantic features into both the PWF and VWF modules, thereby enhancing feature extraction efficiency. Concurrently, the FP module extracts rich, fine-grained features across multiple scales. The IRFP module consequently enhances the model’s performance in complex scenarios, increasing both the accuracy and robustness of object detection.

(3) Our experimental evaluations on the KITTI [16] dataset demonstrate that the MPVF algorithm achieves recognition accuracies of 85.12%, 48.61%, and 70.12% for cars, pedestrians, and cyclists, respectively, in moderate difficulty detection tasks. Our method surpasses other state-of-the-art detection models in terms of detection accuracy and false detection rates, thereby validating its effectiveness and practicality.

2. Related Works

2.1. Camera-Based 3D Object Detection

Achieving low-cost 3D object detection has emerged as a significant challenge within autonomous driving technologies. Given the advantages of camera sensors—namely their low cost, low power consumption, and ease of integration—numerous researchers have concentrated their efforts on utilizing cameras for 3D object detection. In the field of monocular 3D object detection, SMOKE [4] employs a simplified monocular 3D object detection approach, leveraging keypoint estimation and an anchor-free strategy. M3DSSD [17] integrates multi-scale feature extraction with a feature enhancement module, designing an adaptive depth fusion strategy to optimize the joint regression of depth and other 3D attributes. MonoRUn [18] improves detection performance in complex scenes by combining 3D reconstruction with uncertainty propagation. MonoFlex [19] achieves enhanced accuracy and flexibility through adaptive center point regression, flexible 3D bounding box modeling, and an anchor-free design. However, these methods primarily rely on depth estimation [4,17], geometric constraints [19], and shape priors [18] to infer 3D information. Due to the inherent lack of depth information in monocular images, their 3D localization accuracy remains limited. To address the limitations of monocular detection, multi-view 3D object detection methods have been developed. BEVDepth [20] enhances the accuracy of the transformation from multi-view images to bird’s-eye view features by employing improved depth estimation strategies and a depth-assisted view transformation module. BEVStereo [21] generates high-quality depth-aware BEV features by integrating explicit depth estimation with an advanced stereo matching strategy. DETR3D [22] introduces a transformer-based, anchor-free 3D object detection framework, facilitating end-to-end 3D bounding box prediction through the interaction between 3D object queries and multi-view features. However, these methods still rely on transformations from camera-view to BEV, which are constrained by the lack of depth information, making it difficult to ensure robustness in complex scenarios. In contrast, multimodal fusion methods integrate the geometric information of point clouds with the texture and color information of RGB images, demonstrating superior detection accuracy and adaptability in challenging environments.

2.2. LIDAR-Based 3D Object Detection

LiDAR sensors can directly capture high-precision 3D point cloud data, eliminating the depth estimation errors inherent in camera-based approaches. As a result, LiDAR-based 3D object detection has garnered significant attention. In two-stage point cloud-based detection methods, PointRCNN [23] directly generates candidate bounding boxes from raw point clouds and refines them through classification and regression. Point-GNN [24] leverages Graph Neural Networks (GNNs) to model the topological structure of point clouds, utilizing message passing mechanisms to enhance 3D object detection performance. However, these methods often struggle with sparse and noisy point cloud data, making them susceptible to local information loss and reducing their robustness. To improve efficiency, researchers have proposed single-stage point cloud detection methods. YOLO3D [25] adopts a single-stage detection strategy based on the YOLO [26] framework, directly regressing the pose and size of 3D objects from LiDAR point cloud data to improve inference speed and reduce computational cost. 3DSSD [27] integrates a fusion-based farthest point sampling strategy with a boundary-aware module to enhance detection speed and accuracy. CIA-SSD [28] employs voxel feature extraction and a center IoU-aware strategy to optimize bounding box predictions and improve localization precision. Despite these advancements, single-stage methods still face challenges such as sparse point cloud data and information loss during voxelization, which particularly affect small object detection and dynamic scene adaptability. In contrast, multimodal fusion approaches combine RGB visual information with LiDAR’s geometric features, effectively enhancing detection accuracy and robustness, and demonstrating superior performance in complex environments.

2.3. Multi-Modal Fusion-Based 3D Object Detection

Multi-modal fusion methods combine the texture and color information of RGB images with the depth and geometric features of LiDAR point clouds, significantly enhancing the accuracy and robustness of object detection in complex environments. In recent years, various fusion strategies have been proposed, primarily including early fusion, late fusion, and mid fusion.

2.3.1. Early Fusion

Early fusion methods perform data fusion at the input layer of the network, fully utilizing the raw information from each modality. For example, F-PointNet [10] utilizes a 2D detector to generate candidate regions, mapping them into a 3D point cloud to form a viewing frustum, followed by instance segmentation and bounding box regression within the frustum. F-ConvNet [29] aggregates point cloud features through a sliding frustum, combining them with regions generated by 2D image detection to extract local features via convolutional networks, thus improving detection accuracy and scene understanding. PointPainting [9] projects image semantic segmentation results onto the point cloud, enhancing the expressive capabilities of point cloud features and proposing a sequential fusion detection method. However, these methods are prone to errors during the image-to-point cloud alignment process, which negatively impacts the final detection accuracy. Additionally, they often rely on independent 2D detectors, which increases computational cost.

2.3.2. Late Fusion

Late fusion methods perform feature matching at the final stage of 3D object detection, enabling the integration of cross-modal information. For instance, CLOCs [12] enhances detection accuracy and scene understanding by fusing camera and LIDAR detection results, aligning and integrating features from candidate targets in both the image and point cloud. Fast-CLOCs [13] builds upon CLOCs [12], optimizing and improving the fusion strategy to achieve faster computational speeds, thereby accommodating a broader range of real-time application requirements. However, as each modality is processed independently before fusion, these methods fail to fully exploit the complementary nature of visual and geometric information, limiting further improvements in detection performance.

2.3.3. Intermediate Fusion

Intermediate fusion methods perform multi-modal fusion at the feature extraction stage, balancing the fine-grained feature representation of early fusion methods with the computational efficiency of late fusion methods. In recent years, this strategy has become a popular research focus in 3D object detection. MV3D [30] combines multi-view images and point cloud data, generating candidate regions separately from images and point clouds, and achieving accurate detection through multi-view feature projection and fusion. AVOD [31] facilitates efficient and accurate 3D object detection through joint 3D candidate region generation and multi-view feature aggregation. ContFuse [32] proposes a deep continuous fusion strategy that integrates features from different sensors at multiple levels within the network, effectively combining visual and geometric information. MVX-Net [14] extends VoxelNet [33] by incorporating image features, achieving effective integration of visual and geometric information through deep fusion of multi-modal features. MMF [34] employs a multi-task learning and multi-sensor fusion strategy, optimizing multiple tasks while jointly leveraging multi-modal features. EPNet [15] enhances the expression of point cloud features using semantic information from images, proposing an augmented detection approach. BEVFusion [35] fuses multi-modal data in the bird’s-eye view (BEV) perspective to achieve more accurate 3D object detection. Although intermediate fusion methods perform well in many aspects, they still face challenges such as insufficient interaction between local and global features and limited fine-grained feature extraction capability. These issues pose significant challenges for detecting small objects at long distances and recognizing targets in complex scenarios.

To address these issues, we propose a multi-modal intermediate fusion algorithm for 3D object detection, MPVF, introducing the PVWF module to enhance interaction between multi-modal local and global features, and designing the IRFP module to improve feature extraction efficiency and capture fine-grained features across different resolutions.

3. Methodology

3.1. Network Architecture

The proposed overall network architecture for 3D object detection, termed MPVF, is illustrated in Figure 1. It primarily consists of the IRFP module, the PVWF module, and the detection network. The IRFP module includes two components: the IR module and the FP module, which are responsible for extracting high-level semantic features and multi-scale features from images, respectively. The PVWF module is designed to fuse local and global features, while the detection network integrates the 2D detection network RetinaNet [36] and the 3D detection network VoxelNet [33] based on the outputs from the IRFP and PVWF modules, along with a projection consistency correction operation for the 3D candidate boxes and 2D detection boxes.

Initially, the point cloud data is voxelized to extract point features. Simultaneously, a more expressive feature extraction module, IRFP, is designed, wherein the IR module efficiently extracts features from RGB images. Using the calibration data from the KITTI dataset [16], the LiDAR point cloud data are projected onto the RGB images at both pointwise and voxelwise levels, ensuring precise alignment between the camera and LiDAR. The features are then obtained through feature indexing and the Region of Interest align (RoI align) layer [37], which respectively yield pointwise and voxelwise image features. During the intermediate feature extraction phase of the IR module, the FP module is constructed to capture fine-grained features at varying resolutions, thereby enhancing the model’s performance in complex scenes. Subsequently, the FP module is embedded into RetinaNet for detection. Next, the PVWF module is designed to fuse local features extracted by the PWF module and global features obtained from the VWF module. By utilizing Stacked Voxel Feature Encoding (Stacked VFE) [33], pointwise and voxelwise image features are integrated into the point features, facilitating both pointwise and voxelwise fusion. A concatenation operation is employed to merge the pointwise fused features and voxelwise fused features, enhancing the interaction between multi-modal local and global features and thereby improving the model’s detection accuracy, especially in small object detection and complex scenes. Finally, the fused features extracted by the PVWF module are passed to the 3D Region Proposal Network (3D RPN) [38] to generate 3D candidate boxes. Calibration data are then used to project these candidate boxes onto the 2D detection images for consistency correction, optimizing the 3D candidate boxes and ultimately achieving the 3D detection task.

3.2. IRFP Module

To address the resolution of 375 × 1242 for three-channel camera images, we designed a more expressive feature extraction module, IRFP, as shown in Figure 2. IRFP improves the feature extraction strategy of RetinaNet [36] by optimizing the feature pyramid structure and incorporating group convolutions, thereby enhancing multi-scale feature representation. Unlike YOLO’s single-stage detection framework [26], IRFP employs multi-stage feature optimization, improving small object detection performance, particularly in complex scenarios. This module consists of the IR module and the FP module.

In the standard data augmentation process, random flipping and random noise addition are applied to the original RGB images with a resolution of 375 × 1242. Meanwhile, the shortest side of the image is scaled to 600 pixels while maintaining the aspect ratio, resulting in the RGB image size being adjusted to 600 × 1988 × 3. This preprocessing step is designed to ensure that the model is trained with a uniform input size, thereby improving computational efficiency and enhancing the stability of the model’s performance. The IR module is an enhancement of ResNet-101 [39], containing five convolutional layers (C1 to C5). The height and width of the feature map output by layers C1 to C5 are sequentially halved, with the number of channels being 64, 256, 512, 1024, and 2048, respectively. Layers C2 to C5 consist of 3, 4, 23, and 3 residual blocks, respectively. Taking the first residual block of C3 as an example, the input feature map has 256 channels. After a series of operations involving 1 × 1 convolution, 3 × 3 convolution, and another 1 × 1 convolution, the output channel counts become 128, 128, and 512, respectively. The convolution operations within the residual block are divided into 32 groups, with each group performing convolution independently, followed by concatenation of the results. This grouped convolution strategy not only reduces computational complexity but also enhances the model’s expressive capability.

The FP module is constructed on the output feature maps of the C3 to C5 layers in the IR module. It modifies the feature map’s height and width through 2× upsampling in a bottom-up manner, while a 1 × 1 convolution adjusts the number of channels. The results from these two paths are then summed and processed through a 3 × 3 convolution to yield the P3 to P5 feature maps, effectively incorporating high-level semantic information from the upper layers and high-resolution details from the lower layers. To optimize computational resource utilization, the FP module does not integrate features from the C2 layer, as this would significantly increase computational complexity due to the P2 feature maps produced by the C2 layer. A 3 × 3 convolution with a stride of 2 is applied to the P5 layer to generate the P6 feature map, while the P7 feature map is obtained by applying another 3 × 3 convolution with a stride of 2 on the P6 layer. Both P6 and P7 are downsampled from P5, aiming to enlarge the receptive field and capture global information from the image. The FP module effectively captures fine-grained features at varying resolutions, enhancing the model’s detection performance in complex scenes.

3.3. PVWF Module

To enhance the interaction between multi-modal local and global features, we propose a pointwise and voxelwise fusion module, PVWF, as illustrated in Figure 3. This module comprises the pointwise fusion (PWF) module and the voxelwise fusion (VWF) module, with Stacked VFE as its core component.

Initially, the LiDAR point cloud data are projected onto the RGB image, and the pointwise and voxelwise image features are obtained through feature indexing and RoI align operations. The point features are derived from the original point cloud after voxelization.

The VFE network serves as a feature learning module designed to encode the raw point cloud within each voxel, extracting geometric information and local features in 3D space. For a given point cloud, let us assume it covers the 3D space along the Z, Y, and X axes, with ranges D, H, and W, respectively. To effectively represent this space, we partition the 3D space into evenly spaced voxel grids, defining the size of each voxel as $v_{D} \times v_{H} \times v_{W}$ . The resulting 3D voxel grid sizes are $D^{'} = D / v_{D}$ , $H^{'} = H / v_{H}$ , and $W^{'} = {W / v}_{W}$ . Subsequently, points in the point cloud are grouped into their corresponding voxels. From voxels containing more than T points, a fixed number T of points are randomly sampled to ensure computational stability and reduce redundant data. Equation (1) describes a non-empty voxel V containing $t \leq T$ LiDAR points, where each point includes its XYZ coordinates and reflectance value [33].

(1) $V = {\{p_{i} = {[x_{i}, y_{i}, z_{i}, r_{i}]}^{T} \in R^{4}\}}_{i = 1 \dots t}$

To further encode the local features of these points, we first calculate the local mean of all points as the centroid, denoted as $(v_{x}, v_{y}, v_{z})$ . Next, Equation (2) describes the input feature set formed by expanding each point’s relative offset to the centroid [33]. This expanded point feature, combined with the original coordinates and reflectance values of the points, constitutes the voxelized features used as input to the PVWF module.

(2) $V_{i n} = {\{p_{i}^{'} = {[x_{i}, y_{i}, z_{i}, r_{i}, x_{i} - v_{x}, y_{i} - v_{y}, z_{i} - v_{z}]}^{T} \in R^{7}\}}_{i = 1 \dots t}$

In the main body of the VFE, each point is transformed through a VFE layer that includes fully connected networks (FCN), entering a high-dimensional feature space. During this process, the VFE aggregates the features of each point within the voxel, better encoding the local geometric shape inside it. The pointwise features after FCN processing are denoted as $f_{i}$ (where $f_{i} \in R^{m}$ ). For a non-empty voxel V containing $t \leq T$ points, $f_{i}$ is aggregated through element-wise max pooling to form the voxelwise local aggregated feature $f_{a g g}$ (where $f_{a g g} \in R^{m}$ ). Finally, these pooled local aggregated features $f_{a g g}$ are extended and concatenated with the original features $f_{i}$ of each point to form a new pointwise concatenated feature $f_{c o n c a t}$ , which represents the spatial relationships and local geometric information of all points within the voxel, constituting the output feature set. By stacking the VFE layers, the input point cloud is progressively transformed into high-dimensional feature representations. We denote the $i_{t h}$ VFE layer as $V F E_{i} (n_{i n}, n_{o u t})$ , converting the input feature dimension $n_{i n}$ to the output feature dimension $n_{o u t}$ . The linear layer learns a matrix of size $n_{i n} \times (n_{o u t} / 2)$ , while the pointwise concatenation operation yields a final output of dimension $n_{o u t}$ .

For the PWF module, the configuration of Stacked VFE is $V F E_{1} (7 + 16, 32)$ and $V F E_{2} (32, 128)$ . The input to the first VFE layer is the concatenated result of the 7-dimensional point features and the 16-dimensional pointwise image features. The 16-dimensional pointwise image features are derived from the 2048-dimensional features extracted from the C5 layer of the IR module, which are first reduced to 128 dimensions through two fully connected layers with batch normalization (BN) and ReLU, and subsequently down to 16 dimensions. The point features are then concatenated with the reduced pointwise image features, resulting in 128-dimensional pointwise fusion features after processing through the Stacked VFE module. The advantage of the PWF module lies in its early combination of image and point features, allowing the network to learn and aggregate local feature information from multiple modalities through the VFE layers. Therefore, PWF primarily serves to extract multi-modal local features, enhancing the perception ability of local scenes.

For the VWF module, the configuration of Stacked VFE is $V F E_{1} (7, 32)$ and $V F E_{2} (32, 64)$ . The input image features for the PVWF module originate from the feature map output by the C5 layer of the IR module. Initially, 2048-dimensional image features at the voxelwise projection positions are obtained through RoI align operations and subsequently reduced to 256 dimensions and then to 64 dimensions through 2 fully connected layers. These 64-dimensional voxelwise image features are then concatenated with the output of $V F E_{2}$ , yielding 128-dimensional voxelwise fusion features. By retaining more spatial information through RoI align operations when extracting voxel-wise input image features, the alignment and fusion of point cloud features and image features are facilitated. The advantage of the VWF module lies in its ability to supplement missing information from distant or sparse LiDAR point clouds using image features, thereby reducing reliance on high-resolution LiDAR data. The VWF module employs an intermediate fusion strategy at the voxelwise level to incorporate features from RGB images, extracting multi-modal global features.

In our MPVF architecture, the PVWF module adopts an intermediate fusion strategy to extract and utilize fine-grained features more effectively. It is designed to fuse the local features extracted by the pointwise fusion module PWF with the global features extracted by the voxelwise fusion module VWF. Through Stacked VFE operations, the image features at the pointwise and voxelwise levels are integrated into point features, achieving fusion between pointwise and voxelwise representations. Finally, through concatenation operations, the pointwise and voxelwise fusion features are further combined to form a 256-dimensional multi-modal fusion feature, enhancing the interaction between multi-modal local and global features. This ultimately improves the model’s performance in small object detection and complex scenes, enhancing the accuracy and robustness of 3D object detection.

3.4. Detection Network

In our MPVF architecture, the detection network integrates the IRFP and PVWF modules, embedding the 2D detection network RetinaNet [36] and the 3D detection network VoxelNet [33], and optimizing the 3D candidate boxes through projection consistency correction, ultimately accomplishing the 3D object detection task.

2D Detection Network. We utilize an improved version of RetinaNet, incorporating multi-scale features from the IRFP module to enhance detection performance. The specific optimizations are as follows: (1) Base Network: We use the IR module pre-trained on ImageNet [40] as the base network, fine-tuning it with RGB images and corresponding bounding box annotations from the KITTI dataset. Once training is completed, the high-level features from the IR module (i.e., output from the C5 layer) will be utilized for the 3D object detection task; (2) Utilization of High-Level Semantic Features: The output of the C5 layer from the IR module provides high-level semantic features, serving as prior information to better infer the presence of targets; (3) Multi-Scale Feature Integration: The multi-scale features extracted from the P3 to P7 layers of the FP module are integrated into the classification and regression subnetworks of RetinaNet, allowing the model to capture fine-grained features of targets at different resolutions, thereby enhancing detection accuracy and robustness; (4) Classification and Regression Subnetworks: These subnetworks process the pyramid feature maps (P3 to P7) to predict the classes of anchor boxes and their boundary box coordinates, enabling target classification and localization. The classification subnetwork applies 4 3 × 3 convolutional layers (with 256 filters and ReLU activation) to each feature map, followed by a 3 × 3 convolutional layer with KA filters, where each spatial location produces KA binary classification predictions through a sigmoid function, with A representing the number of anchor boxes and K the number of classes. The boundary regression subnetwork has the same structure as the classification subnetwork but outputs 4A linear values, predicting the offsets for each anchor box relative to the ground truth boxes.

3D Detection Network. We adopt an improved VoxelNet [33] structure, combined with pointwise and voxelwise fused features extracted by the PVWF module for 3D object detection. Specifically, the point cloud data are first voxelized, dividing the 3D space into voxels of size $v_{D} = 0.4$ , $v_{H} = 0.2$ , and $v_{W} = 0.2$ . The detection range for the point cloud is set to Z direction [−3, 1], Y direction [−40, 40], and X direction [0, 70.4], resulting in a 3D voxel grid size of $D^{'} = 10$ , $H^{'} = 400$ , and $W^{'} = 352$ . Thus, the features extracted by the PVWF module form a 4D tensor with a shape of 256 × 10 × 400 × 352. Next, this tensor is input into a 3D convolutional layer for feature aggregation, where 3 3D convolutions aggregate the pointwise and voxelwise fused features, transforming the shape into 256 × 400 × 352 3D features. These processed features are then fed into the 3D Region Proposal Network (RPN) to generate a high-resolution feature map of 1536 × 200 × 176. Subsequently, 2D convolutions are used to predict the probability score map and regression map, resulting in preliminary 3D candidate boxes. Finally, the 3D candidate boxes are projected onto the 2D detection image using calibration data for consistency correction, optimizing the localization of the 3D candidate boxes and further improving the accuracy of 3D object detection.

By integrating the improved RetinaNet and VoxelNet detection frameworks and applying projection correction to the 3D candidate boxes, the MPVF architecture achieves high-precision 3D object detection. This design effectively utilizes multi-modal information, enhancing the robustness and accuracy of the system.

4. Experiments and Results

In this section, we conduct testing experiments to validate the performance of our framework and perform ablation studies to assess the effectiveness of the algorithm components. Our experimental environment is based on the Ubuntu 18.04 operating system, utilizing Python 3.8 as the programming language and PyTorch 1.12.1 as the framework. The 3D object detection framework is built on the open-source mmdetection3d platform.

4.1. Dataset

We selected the KITTI [16] dataset for our experiments and evaluations based on its widespread adoption, high-quality annotations, and moderate computational requirements. Compared to the Waymo Open and nuScenes datasets, KITTI maintains data accuracy while reducing computational costs and offering better experimental comparability, making it more suitable for this study. KITTI is one of the most comprehensive and classic datasets in the field of autonomous driving, providing 360° LiDAR point clouds, front-view RGB images, and calibration data. The dataset contains 7481 training samples and 7518 testing samples. To avoid samples from the same sequence appearing in both sets, the training samples were divided into a training set of 3712 frames and a validation set of 3769 frames. The dataset is categorized into three difficulty levels—easy, moderate, and hard—based on object size, occlusion, and truncation conditions. We established different Intersection over Union (IoU) thresholds for cars, pedestrians, and cyclists to accurately assess model performance across varying difficulty levels.

4.2. Evaluation Metrics

We adopt the official KITTI benchmark evaluation metrics [16], including Average Precision (AP) and mean Average Precision (mAP), strictly following the KITTI evaluation standards to ensure the effectiveness of our method. To ensure consistency in detection accuracy, different IoU thresholds were established for various object categories: 0.7 for cars and 0.5 for both pedestrians and cyclists. Due to their larger size and fixed shape, vehicles allow for stricter localization requirements. In contrast, pedestrians and cyclists, with their smaller size and varying postures, pose greater challenges for accurate detection. Adopting a uniform IoU threshold of 0.5 may reduce the accuracy of vehicle detection. To ensure the reproducibility and fairness of the experiments, this study provides a detailed IoU-based evaluation method. The precision is determined by comparing the IoU of predicted boxes with ground truth boxes against the set thresholds; if the IoU exceeds the threshold, it is classified as a true positive (TP), otherwise it is classified as a false positive (FP). Consequently, the recall and precision can be expressed as follows:

(3) $Recall = \frac{T P}{T P + F N}$

(4) $Precison = \frac{T P}{T P + F P}$

Here, FN denotes false negatives. We adopt the latest AP40 [41] standard from KITTI and compute AP using an interpolation method to enhance the reliability of the evaluation, as shown in Equations (5) and (6). For multi-class detection, the mAP serves as a comprehensive evaluation metric, obtained by calculating the average AP across all categories, as indicated in Equation (7).

(5) $P_{i n t e r p} (r) = \max_{r' \geq r} P (r')$

(6) ${AP}_{i} = \frac{1}{|R_{40}|} \sum_{r \in R_{40}} P_{i n t e r p} (r)$

(7) $mAP = \frac{1}{k} \sum_{i = 1}^{k} A P_{i}$

Here, $P_{i n t e r p} (r)$ represents the interpolation function, $i$ denotes the object category index, k is the total number of categories, r indicates the sampled recall values, and $R_{40}$ [41] represents the set of 40 recall values distributed evenly within the range [0, 1].

4.3. Experimental Details

2D Detector. We set the batch size for training to 8. The improved RetinaNet was trained using stochastic gradient descent (SGD) with a momentum factor of 0.9, a learning rate of 0.001, and a weight decay coefficient of 0.0001. To enhance the overlap between anchor boxes and target boxes, we configured three aspect ratios (1:2, 1:1, 2:1) for each pyramid level, employing anchor boxes with scales $\{2^{0}, 2^{1 / 3}, 2^{2 / 3}\}$ . Each anchor box was assigned a 3-dimensional 1-hot vector as a classification target (corresponding to car, pedestrian, and cyclist), along with a 4-dimensional vector as a regression target. Anchor boxes with an IoU greater than 0.7 with the ground truth were considered positive samples, while those with an IoU less than 0.3 were deemed negative samples. Matching was performed using an IoU threshold of 0.5, with anchor boxes having IoU in the range (0, 0.4) allocated as background, and unallocated anchor boxes ignored during training. The regression targets were calculated based on the offsets between the anchor boxes and their corresponding target boxes.

3D Detector. In training the 3D detector, we utilized the AdamW optimizer with a cyclic learning rate strategy for 40 epochs, applying a weight decay of 0.001 on a single GPU. The Cyclic Learning Rate (CLR) adjusts the learning rate periodically, increasing it from 1 × 10⁻⁴ to 1 × 10⁻³ over each cycle (10 epochs), then gradually decreasing it back to 1 × 10⁻⁴. The momentum factor varies inversely with the learning rate, ranging from (0.85, 0.99). This contrasting change mechanism aids the model in broader exploration at high learning rates and fine-tuning at low learning rates.

4.4. Experimental Results

In this study, we conducted both quantitative and qualitative experiments using the KITTI dataset to comprehensively evaluate the effectiveness and reliability of the proposed MPVF method. First, quantitative experiments were performed on the KITTI test and validation sets, where MPVF was compared with state-of-the-art methods to assess its performance in detecting different object categories, including cars, pedestrians, and cyclists. Additionally, we strictly followed the official KITTI evaluation metrics, including AP and mAP, and adopted the AP40 computation standard to enhance the reliability of the assessment. Second, qualitative experiments were conducted on the KITTI validation set to analyze the detection performance of MPVF in complex scenarios, such as occlusion, truncation, and small distant objects, as well as across different road environments, including highways, urban intersections, suburban roads, and rural roads. The experimental results demonstrate that MPVF outperforms existing methods in detecting distant objects and handling challenging conditions, further validating the robustness of the proposed approach.

Through these experiments, this study ensures the effectiveness and reliability of the proposed method and confirms the applicability of MPVF in various challenging environments.

4.4.1. Quantitative Analysis

Evaluation on the KITTI Test Set

We conducted a quantitative comparison of the proposed MPVF method against previous state-of-the-art 3D object detection methods on the KITTI benchmark dataset, as shown in Table 1. Although our model exhibited slightly longer running times, likely due to its higher complexity, the overall detection accuracy surpassed that of advanced unimodal methods such as LiGA-Stereo [42], StereoDistill [43], PVT-SSD [8], APVR [44], and GLENet-VR [45]. Notably, our method demonstrated significant advantages in detecting pedestrians and cyclists. This indicates that MPVF effectively integrates multimodal information and outperforms traditional unimodal detection methods in scenarios involving small targets and complex environments, showcasing the efficacy and practicality of our approach. When compared to advanced multimodal fusion methods such as ACF-Net [46], VirConv-L [47], GraphAlign++ [48], SQD [49], and RoboFusion-L [50], our method also exhibited superior performance, achieving 3D AP of 85.12%, 48.61%, and 70.12% for medium-difficulty detections of cars, pedestrians, and cyclists, respectively. It should be noted that in pedestrian detection tasks, the complexity of pedestrian actions and pose variations usually result in lower 3D detection accuracy, as is also reflected in existing research. Therefore, although the inference time is slightly longer, our method is still shown to outperform most existing state-of-the-art techniques, especially in complex scenarios. The mAP for the three categories of MPVF was 69.24%, representing an improvement of 2.75% over the mAP of 66.49% achieved by GraphAlign++ [48], further validating the effectiveness of our method.

2.. Evaluation on the KITTI Validation Set

We conducted a quantitative evaluation of the proposed MPVF method on the KITTI validation set and compared it with state-of-the-art 3D object detection methods, as shown in Table 2. The results demonstrate that our detection accuracy surpasses that of advanced single-modal methods, including LIGA-Stereo [42], StereoDistill [43], CIA-SSD [28], APVR [44], and GLENet-VR [45]. In comparison with advanced multimodal fusion methods such as ACF-Net [46], VirConv-L [47], GraphAlign++ [48], and RoboFusion-L [50], our method achieves comparable or even superior performance. Specifically, for car detection across easy, moderate, and hard levels, our model achieves 3D AP scores of 93.68%, 89.87%, and 85.59%, respectively. Compared to ACF-Net [46], MPVF improves the 3D AP for moderate car detection by 1.07%, further validating the effectiveness and practicality of our approach.

4.4.2. Qualitative Analysis

We conducted inference tests on the KITTI validation set using MPVF and GraphAlign++ [48] and visualized their 3D detection results in specific scenarios to highlight MPVF’s performance advantages and demonstrate the applicability of the proposed network structure. Figure 4 and Figure 5 show the 3D object detection results for cars, pedestrians, and cyclists under various road types and complex environments using both methods.

Figure 4 presents the detection results across highways, intersections, suburban roads, and rural roads: (a) Highway: MPVF outperforms GraphAlign++ in detecting distant vehicles. The green circles highlight missed or incorrectly detected vehicles by GraphAlign++, while MPVF accurately identifies them; (b) Intersection: MPVF correctly detects all vehicles, while GraphAlign++ incorrectly identifies a road sign as a pedestrian, demonstrating the superior robustness of MPVF; (c) Suburban Road: GraphAlign++ misclassifies a cyclist as a pedestrian and misses distant vehicles, whereas MPVF effectively detects both, showcasing its advantage in long-range small-object detection; (d) Rural Road: GraphAlign++ misclassifies a pedestrian as a cyclist, while MPVF exhibits higher accuracy in detecting distant pedestrians.

Figure 5 illustrates the detection results in challenging scenarios with severe occlusion, heavy truncation, and distant small objects: (a) Severe Occlusion: GraphAlign++ fails to detect two cars; (b) Heavy Truncation: GraphAlign++ misses both a truncated vehicle and three distant cars; (c) Distant Small Objects: GraphAlign++ fails to detect remote cyclists and vehicles. In contrast, MPVF accurately detects objects under all these challenging conditions.

These comparative results highlight MPVF’s significant advantage in detecting distant small objects, validating the effectiveness of its local and global feature fusion strategy. By enhancing interactions between local and global features, this strategy significantly improves detection accuracy, particularly for small objects. The results demonstrate MPVF’s technical superiority in various complex environments.

4.5. Ablation Experiment

To validate the effectiveness of each module in the proposed MPVF algorithm, we conducted an ablation experiment on the KITTI validation set. Specifically, we built baseline models based on the PWF and VWF modules, respectively. These baseline models exclude the grouped convolution strategy and the FP module but retain the same detection heads as MPVF. On top of these baselines, we gradually introduced the IR module, FP module, and PVWF module to evaluate their contributions to the overall detection performance, thereby verifying the effectiveness of each component. Table 3 presents the experimental results, where PWF baseline and VWF baseline refer to the two baseline models. 3D AP represents the average precision calculated for the car category at moderate difficulty.

In Experiment 1, the detection accuracy of the PWF baseline model was 77.30%. By incorporating the IR module and FP module individually, the accuracy improved to 79.72% and 81.96%, reflecting gains of 2.42% and 4.66%, respectively. When both the IR and FP modules were introduced simultaneously (forming the IRFP module), the accuracy further increased to 84.27%, resulting in an overall improvement of 6.97%, demonstrating the effectiveness of the IR, FP, and IRFP modules.

In Experiment 2, the VWF baseline model achieved an initial accuracy of 77.12%. With the addition of the IR and FP modules, the accuracy increased to 79.31% and 81.82%, respectively, yielding improvements of 2.19% and 4.70%. Incorporating the IRFP module led to a further increase to 83.94%, representing a total gain of 6.82%, again confirming the effectiveness of the IR, FP, and IRFP modules.

The results of Experiments 1 and 2 indicate that the IR module enhances feature extraction efficiency, thereby improving detection accuracy. Similarly, the FP module captures fine-grained features across different resolutions, boosting the model’s performance in complex scenarios.

In Experiment 3, when both the IRFP and PVWF modules were integrated into the PWF and VWF baseline models, the detection accuracy increased to 89.87%. These results highlight the critical role of interactions between local and global features in enhancing detection performance.

In summary, each module contributes significantly to the overall performance of MPVF, with the fusion and interaction mechanisms between pointwise and voxelwise features playing a crucial role in achieving optimal detection results. These experimental outcomes further validate the rationality and necessity of the proposed module designs.

5. Conclusions

To address the insufficient interaction between local and global features, as well as the inadequate extraction of fine-grained features in existing Camera-LiDAR mid-fusion methods, this paper proposes MPVF, a multimodal 3D object detection algorithm based on pointwise and voxelwise fusion. The algorithm effectively enhances the interaction between local and global multimodal features by designing new fusion and feature extraction modules, thereby improving the performance of the detection model in complex scenarios. Specifically, the proposed PVWF fusion module leverages the advantages of local pointwise features and global voxelwise features, enhancing the model’s performance in detecting small objects and operating in intricate environments. Additionally, the design of the IRFP module significantly improves feature extraction efficiency through a grouped convolution strategy in the IR module, while the FP module captures fine-grained features across multiple scales, further enhancing the model’s detection accuracy and robustness. Experiments on the large-scale KITTI dataset demonstrate that MPVF exhibits excellent performance across various object categories, particularly achieving significant improvements in the detection tasks for pedestrians and cyclists. Overall, MPVF illustrates the immense potential of multimodal fusion in the field of 3D object detection, providing an effective solution to enhance detection accuracy and system robustness. Future work will focus on further optimizing fusion strategies and exploring applications in more complex and diverse scenarios, advancing the practical use of 3D object detection technologies in autonomous driving and other cutting-edge fields.

Author Contributions

Conceptualization, W.W. and P.S.; methodology, W.W. and P.S.; software, W.W.; formal analysis, W.W.; writing—original draft preparation, W.W.; visualization, W.W.; validation, W.W., P.S. and A.Y.; investigation, P.S. and A.Y.; resources, P.S.; data curation, W.W. and A.Y.; project administration, A.Y. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Overall Network Architecture of MPVF.

Figure 2. Specific Structure of the IRFP Module.

Figure 3. Specific Structure of the PVWF Module.

View Image - Figure 4. Visual comparison of MPVF and GraphAlign++ across different road types. The left column shows the results of MPVF, while the right column presents the predictions of GraphAlign++. Red boxes indicate cars and cyclists, while purple boxes represent pedestrians. (a) Highway Scenario; (b) Intersection Scenario; (c) Suburban Road Scenario; (d) Rural Road Scenario.

Figure 4. Visual comparison of MPVF and GraphAlign++ across different road types. The left column shows the results of MPVF, while the right column presents the predictions of GraphAlign++. Red boxes indicate cars and cyclists, while purple boxes represent pedestrians. (a) Highway Scenario; (b) Intersection Scenario; (c) Suburban Road Scenario; (d) Rural Road Scenario.

View Image - Figure 5. Visual comparison of MPVF and GraphAlign++ in complex scenarios (occlusion, truncation, and distant small objects). The left column shows the results of MPVF, while the right column presents the predictions of GraphAlign++. Red boxes indicate cars and cyclists, while purple boxes represent pedestrians. (a) Severe Occlusion Scenario; (b) Heavy Truncation Scenario; (c) Distant Small Object Scenario.

Figure 5. Visual comparison of MPVF and GraphAlign++ in complex scenarios (occlusion, truncation, and distant small objects). The left column shows the results of MPVF, while the right column presents the predictions of GraphAlign++. Red boxes indicate cars and cyclists, while purple boxes represent pedestrians. (a) Severe Occlusion Scenario; (b) Heavy Truncation Scenario; (c) Distant Small Object Scenario.

Table 1

Comparison of 3D object detection performance for cars, pedestrians, and cyclists on the KITTI test set. The best and second-best results are highlighted in bold and underlined, respectively. A “-” indicates that the data are not publicly available.

Method	Year	Running Time(s)	Sensor	Car 3D AP (%)			Pedestrians 3D AP (%)			Cyclists 3D AP (%)
Method	Year	Running Time(s)	Sensor	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
SMOKE [4]	2020	0.03	Camera	14.03	9.76	7.84	-	-	-	-	-	-
DSGN [51]	2020	0.67		73.50	52.18	45.14	20.53	15.55	14.15	27.76	18.17	16.21
M3DSSD [17]	2021	-		17.51	11.46	8.98	5.16	3.87	3.08	2.10	1.51	1.58
MonoRUn [18]	2021	0.07		19.65	12.30	10.58	10.88	6.78	5.83	1.01	0.61	0.48
MonoFlex [19]	2021	0.03		19.94	13.89	12.07	9.43	6.31	5.26	4.17	2.35	2.04
LiGA-Stereo [42]	2021	0.40		81.39	64.66	57.22	40.46	30.00	27.07	54.44	36.86	32.06
ImVoxelNet [5]	2022	0.20		17.15	10.97	9.15	-	-	-	-	-	-
MonoDETR [52]	2023	0.04		25.00	16.47	13.58	-	-	-	-	-	-
StereoDistill [43]	2023	0.40		81.66	66.39	57.39	44.12	32.23	28.95	63.96	44.02	39.19
MonoCD [53]	2024	0.04		25.53	16.59	14.53	-	-	-	-	-	-
VoxelNet [33]	2018	0.22	LIDAR	77.47	65.11	57.73	39.48	33.69	31.51	61.22	48.36	44.37
SECOND [54]	2018	0.05		83.13	73.66	66.20	51.07	42.56	37.29	70.51	53.85	46.90
PointRCNN [23]	2019	0.10		86.96	75.64	70.70	47.98	39.37	36.01	74.96	58.82	52.53
STD [55]	2019	0.08		87.95	79.71	75.09	53.29	42.47	38.35	78.69	61.59	55.30
Part-A2 Net [56]	2020	0.08		87.81	78.49	73.51	53.10	43.35	40.06	79.17	63.52	56.93
Point-GNN [24]	2020	0.60		88.33	79.47	72.29	51.92	43.77	40.14	78.60	63.48	57.08
3DSSD [27]	2020	0.04		88.36	79.57	74.55	54.64	44.27	40.23	82.48	64.10	56.90
SA-SSD [57]	2020	0.04		88.75	79.79	74.16	-	-	-	-	-	-
PV-RCNN [58]	2020	0.08		90.25	81.43	76.82	52.17	43.29	40.29	78.60	63.71	57.65
Pyramid R-CNN [59]	2021	0.13		88.39	82.08	77.49	-	-	-	-	-	-
CIA-SSD [28]	2021	0.03		89.59	80.28	72.87	-	-	-	-	-	-
Voxel R-CNN [7]	2021	0.04		90.90	81.62	77.06	-	-	-	-	-	-
M3DETR [60]	2022	-		90.28	81.73	76.96	45.70	39.94	37.66	83.83	66.74	59.03
FARP-Net [61]	2023	0.06		88.36	81.53	78.98	-	-	-	-	-	-
PVT-SSD [8]	2023	0.05		90.65	82.29	76.85	-	-	-	-	-	-
APVR [44]	2023	0.02		91.45	82.17	78.08	55.76	44.87	41.34	-	-	-
GLENet-VR [45]	2023	0.04		91.67	83.23	78.43	-	-	-	-	-	-
MV3D [30]	2017	0.36	Fusion	74.97	63.63	54.00	-	-	-	-	-	-
AVOD [31]	2018	0.08		76.39	66.47	60.23	36.10	27.86	25.76	57.19	42.08	38.29
F-PointNet [10]	2018	0.17		82.19	69.79	60.59	50.53	42.15	38.08	72.27	56.12	49.01
ContFuse [32]	2018	0.06		83.68	68.78	61.67	-	-	-	-	-	-
MVX-Net [14]	2019	-		83.20	72.70	65.20	-	-	-	-	-	-
F-ConvNet [29]	2019	0.47		87.36	76.39	66.69	52.16	43.38	38.80	81.98	65.07	56.54
PointPainting [9]	2020	0.40		82.11	71.70	67.08	50.32	40.97	37.87	77.63	63.78	55.89
CLOCs [12]	2020	0.15		88.94	80.67	77.15	-	-	-	-	-	-
F-PointPillars [62]	2021	0.07		-	-	-	51.22	42.89	39.28	-	-	-
Fast-CLOCs [13]	2022	0.13		89.11	80.34	76.98	52.10	42.72	39.08	82.83	65.31	57.43
EPNet++ [63]	2022	0.10		91.37	81.96	76.71	52.79	44.38	41.29	76.15	59.71	53.67
PA3DNet [64]	2023	0.10		90.49	82.57	77.88	-	-	-	-	-	-
ACF-Net [46]	2023	0.11		90.80	84.67	80.14	54.62	46.36	42.57	84.29	68.37	62.08
VirConv-L [47]	2023	0.06		91.41	85.05	80.22	-	-	-	-	-	-
GraphAlign++ [48]	2024	0.15		90.98	83.76	80.16	53.31	47.51	42.26	79.58	65.21	55.68
SQD [49]	2024	0.06		91.58	81.82	79.07	-	-	-	-	-	-
RoboFusion-L [50]	2024	0.30		91.75	84.08	80.71	-	-	-	-	-	-
MPVF(Ours)	2024	0.33		91.93	85.12	79.14	56.10	48.61	43.96	85.47	70.12	62.69

Table 2

Comparison of 3D object detection performance for cars on the KITTI validation set. The best and second-best results are highlighted in bold and underlined, respectively.

Method	Year	Type	Car 3D AP (%)
Method	Year	Type	Easy	Moderate	Hard
SMOKE [4]	2020	Camera	14.76	12.85	11.50
DSGN [51]	2020		72.31	54.27	47.71
M3DSSD [17]	2021		27.77	21.67	18.28
MonoRUn [18]	2021		20.02	14.65	12.61
MonoFlex [19]	2021		23.64	17.51	14.83
LIGA-Stereo [42]	2021		84.92	67.06	63.80
ImVoxelNet [5]	2022		24.54	17.80	15.67
MonoDETR [52]	2023		28.84	20.61	16.38
StereoDistill [43]	2023		87.57	69.75	62.92
MonoCD [53]	2024		26.21	19.43	16.50
VoxelNet [33]	2018	LIDAR	81.97	65.46	62.85
SECOND [54]	2018		87.43	76.48	69.10
PointRCNN [23]	2019		88.88	78.63	77.38
STD [55]	2019		89.70	79.80	79.30
Part-A2 Net [56]	2020		89.47	79.47	78.54
Point-GNN [24]	2020		87.89	78.34	77.38
3DSSD [27]	2020		89.71	79.45	78.67
SA-SSD [57]	2020		92.23	84.30	81.36
PV-RCNN [58]	2020		92.57	84.83	82.69
CIA-SSD [28]	2021		93.59	84.16	81.20
Voxel R-CNN [7]	2021		92.38	85.29	82.86
M3DETR [60]	2022		92.29	85.41	82.25
FARP-Net [61]	2023		89.91	83.99	79.21
APVR [44]	2023		93.49	85.62	81.14
GLENet-VR [45]	2023		93.51	86.10	83.60
MV3D [30]	2017	Fusion	71.29	62.68	56.56
AVOD [31]	2018		84.41	74.44	68.65
F-PointNet [10]	2018		83.76	70.92	63.65
ContFuse [32]	2018		86.32	73.25	67.81
MVX-Net [14]	2019		85.50	73.30	67.40
F-ConvNet [29]	2019		89.02	78.80	77.09
CLOCs [12]	2020		92.48	82.79	77.71
F-PointPillars [62]	2021		88.90	79.28	78.07
EPNet++ [63]	2022		92.51	83.17	82.27
PA3DNet [64]	2023		93.18	86.02	83.74
ACF-Net [46]	2023		93.17	88.80	86.53
VirConv-L [47]	2023		93.36	88.71	85.83
GraphAlign++ [48]	2024		92.58	87.01	84.68
RoboFusion-L [50]	2024		93.30	88.04	85.27
MPVF(Ours)	2024		93.68	89.87	85.59

Table 3

Impact of individual modules on 3D object detection accuracy. The symbol "√" indicates that the module is enabled, while “-“ denotes that the module is not enabled.

Experiment	PWFBaseline	VWFBaseline	IRFP		PVWF	3D AP (%)
Experiment	PWFBaseline	VWFBaseline	IR	FP	PVWF	3D AP (%)
1	√	-	-	-	-	77.30
	√	-	√	-	-	79.72
	√	-	-	√	-	81.96
	√	-	√	√	-	84.27
2	-	√	-	-	-	77.12
	-	√	√	-	-	79.31
	-	√	-	√	-	81.82
	-	√	√	√	-	83.94
3	√	√	√	√	√	89.87

References

1. Arnold, E.; Al-Jarrah, O.Y.; Dianati, M.; Fallah, S.; Oxtoby, D.; Mouzakitis, A. A survey on 3D object detection methods for autonomous driving applications. IEEE Trans. Intell. Transp. Syst.; 2019; 20, pp. 3782-3795. [DOI: https://dx.doi.org/10.1109/TITS.2019.2892405]

2. Meier, J.; Scalerandi, L.; Dhaouadi, O.; Kaiser, J.; Araslanov, N.; Cremers, D. CARLA Drone: Monocular 3D Object Detection from a Different Perspective. arXiv; 2024; arXiv: 2408.11958

3. Xu, G.; Khan, A.S.; Moshayedi, A.J.; Zhang, X.; Shuxin, Y. The object detection, perspective and obstacles in robotic: A review. EAI Endorsed Trans. AI Robot.; 2022; 1, e13. [DOI: https://dx.doi.org/10.4108/airo.v1i1.2709]

4. Liu, Z.; Wu, Z.; Tóth, R. Smoke: Single-stage monocular 3D object detection via keypoint estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; Seattle, WA, USA, 14–19 June 2020; pp. 996-997.

5. Rukhovich, D.; Vorontsova, A.; Konushin, A. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3D object detection. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Waikoloa, HI, USA, 3–8 January 2022; pp. 2397-2406.

6. Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis.; 2023; 131, pp. 1909-1963. [DOI: https://dx.doi.org/10.1007/s11263-023-01790-1]

7. Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3D object detection. Proceedings of the AAAI Conference on Artificial Intelligence; Virtual, 2–9 February 2021; pp. 1201-1209.

8. Yang, H.; Wang, W.; Chen, M.; Lin, B.; He, T.; Chen, H.; He, X.; Ouyang, W. Pvt-ssd: Single-stage 3D object detector with point-voxel transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 13476-13487.

9. Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 14–19 June 2020; pp. 4604-4612.

10. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3D object detection from rgb-d data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 918-927.

11. Qian, R.; Lai, X.; Li, X. 3D object detection for autonomous driving: A survey. Pattern Recognit.; 2022; 130, 108796. [DOI: https://dx.doi.org/10.1016/j.patcog.2022.108796]

12. Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); Las Vegas, NV, USA, 25–29 October 2020; pp. 10386-10393.

13. Pang, S.; Morris, D.; Radha, H. Fast-CLOCs: Fast camera-LiDAR object candidates fusion for 3D object detection. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Waikoloa, HI, USA, 3–8 January 2022; pp. 187-196.

14. Sindagi, V.A.; Zhou, Y.; Tuzel, O. Mvx-net: Multimodal voxelnet for 3D object detection. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA); Montreal, QC, Canada, 20–24 May 2019; pp. 7276-7282.

15. Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3D object detection. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, part XV 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 35-52.

16. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition; Providence, RI, USA, 16–21 June 2012; pp. 3354-3361.

17. Luo, S.; Dai, H.; Shao, L.; Ding, Y. M3dssd: Monocular 3D single stage object detector. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 19–25 June 2021; pp. 6145-6154.

18. Chen, H.; Huang, Y.; Tian, W.; Gao, Z.; Xiong, L. Monorun: Monocular 3D object detection by reconstruction and uncertainty propagation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 19–25 June 2021; pp. 10379-10388.

19. Zhang, Y.; Lu, J.; Zhou, J. Objects are different: Flexible monocular 3D object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 19–25 June 2021; pp. 3289-3298.

20. Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3D object detection. Proceedings of the AAAI Conference on Artificial Intelligence; Washington, DC, USA, 7–14 February 2023; pp. 1477-1485.

21. Li, Y.; Bao, H.; Ge, Z.; Yang, J.; Sun, J.; Li, Z. Bevstereo: Enhancing depth estimation in multi-view 3D object detection with temporal stereo. Proceedings of the AAAI Conference on Artificial Intelligence; Washington, DC, USA, 7–14 February 2023; pp. 1486-1494.

22. Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3D object detection from multi-view images via 3d-to-2d queries. Proceedings of the Conference on Robot Learning; Auckland, New Zealand, 14–18 December 2022; pp. 180-191.

23. Shi, S.; Wang, X.; Li, H. Pointrcnn: 3D object proposal generation and detection from point cloud. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 770-779.

24. Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3D object detection in a point cloud. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 14–19 June 2020; pp. 1711-1719.

25. Ali, W.; Abdelkarim, S.; Zidan, M.; Zahran, M.; El Sallab, A. Yolo3d: End-to-end real-time 3D oriented object bounding box detection from lidar point cloud. Proceedings of the European Conference on Computer Vision (ECCV) Workshops; Munich, Germany, 8–14 September 2018; pp. 716-728.

26. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 779-788.

27. Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3D single stage object detector. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 14–19 June 2020; pp. 11040-11048.

28. Zheng, W.; Tang, W.; Chen, S.; Jiang, L.; Fu, C.-W. Cia-ssd: Confident iou-aware single-stage object detector from point cloud. Proceedings of the AAAI Conference on Artificial Intelligence; Virtual, 2–9 February 2021; pp. 3555-3562.

29. Wang, Z.; Jia, K. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3D object detection. Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); Macao, China, 3–8 November 2019; pp. 1742-1749.

30. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 1907-1915.

31. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); Madrid, Spain, 1–5 October 2018; pp. 1-8.

32. Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3D object detection. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 641-656.

33. Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490-4499.

34. Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3D object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 7345-7353.

35. Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. Bevfusion: A simple and robust lidar-camera fusion framework. Adv. Neural Inf. Process. Syst.; 2022; 35, pp. 10421-10434.

36. Ross, T.-Y.; Dollár, G. Focal loss for dense object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 2980-2988.

37. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 2961-2969.

38. Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3D object proposals for accurate object class detection. Adv. Neural Inf. Process. Syst.; 2015; 28, pp. 424-432.

39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778.

40. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; Miami, FL, USA, 20–25 June 2009; pp. 248-255.

41. Simonelli, A.; Bulo, S.R.; Porzi, L.; López-Antequera, M.; Kontschieder, P. Disentangling monocular 3D object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, Republic of Korea, 27–28 October 2019; pp. 1991-1999.

42. Guo, X.; Shi, S.; Wang, X.; Li, H. Liga-stereo: Learning lidar geometry aware representations for stereo-based 3D detector. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada, 11–17 October 2021; pp. 3153-3163.

43. Liu, Z.; Ye, X.; Tan, X.; Ding, E.; Bai, X. Stereodistill: Pick the cream from lidar for distilling stereo-based 3D object detection. Proceedings of the AAAI Conference on Artificial Intelligence; Washington, DC, USA, 7–14 February 2023; pp. 1790-1798.

44. Cao, J.; Tao, C.; Zhang, Z.; Gao, Z.; Luo, X.; Zheng, S.; Zhu, Y. Accelerating Point-Voxel Representation of 3-D Object Detection for Automatic Driving. IEEE Trans. Artif. Intell.; 2023; 5, pp. 254-266. [DOI: https://dx.doi.org/10.1109/TAI.2023.3237787]

45. Zhang, Y.; Zhang, Q.; Zhu, Z.; Hou, J.; Yuan, Y. Glenet: Boosting 3D object detectors with generative label uncertainty estimation. Int. J. Comput. Vis.; 2023; 131, pp. 3332-3352. [DOI: https://dx.doi.org/10.1007/s11263-023-01869-9]

46. Tian, Y.; Zhang, X.; Wang, X.; Xu, J.; Wang, J.; Ai, R.; Gu, W.; Ding, W. ACF-Net: Asymmetric cascade fusion for 3D detection with lidar point clouds and images. IEEE Trans. Intell. Veh.; 2023; 9, pp. 3360-3371. [DOI: https://dx.doi.org/10.1109/TIV.2023.3341223]

47. Wu, H.; Wen, C.; Shi, S.; Li, X.; Wang, C. Virtual sparse convolution for multimodal 3D object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 21653-21662.

48. Song, Z.; Jia, C.; Yang, L.; Wei, H.; Liu, L. GraphAlign++: An accurate feature alignment by graph matching for multi-modal 3D object detection. IEEE Trans. Circuits Syst. Video Technol.; 2023; 34, pp. 2619-2632. [DOI: https://dx.doi.org/10.1109/TCSVT.2023.3306361]

49. Yujian, M.; Wu, Y.; Zhao, J.; Yinghao, H.; Wang, J.; Yan, J. Sparse Query Dense: Enhancing 3D Object Detection with Pseudo points. Proceedings of the ACM Multimedia 2024; Melbourne, VIC, Australia, 28 October–1 November 2024.

50. Song, Z.; Zhang, G.; Liu, L.; Yang, L.; Xu, S.; Jia, C.; Jia, F.; Wang, L. Robofusion: Towards robust multi-modal 3D obiect detection via sam. arXiv; 2024; arXiv: 2401.03907

51. Chen, Y.; Liu, S.; Shen, X.; Jia, J. Dsgn: Deep stereo geometry network for 3D object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 14–19 June 2020; pp. 12536-12545.

52. Zhang, R.; Qiu, H.; Wang, T.; Guo, Z.; Cui, Z.; Qiao, Y.; Li, H.; Gao, P. MonoDETR: Depth-guided transformer for monocular 3D object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision; Vancouver, BC, Canada, 17–24 June 2023; pp. 9155-9166.

53. Yan, L.; Yan, P.; Xiong, S.; Xiang, X.; Tan, Y. MonoCD: Monocular 3D Object Detection with Complementary Depths. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 17–21 June 2024; pp. 10248-10257.

54. Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors; 2018; 18, 3337. [DOI: https://dx.doi.org/10.3390/s18103337]

55. Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3D object detector for point cloud. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, Republic of Korea, 27–28 October 2019; pp. 1951-1960.

56. Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 43, pp. 2647-2664. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.2977026] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32142423]

57. He, C.; Zeng, H.; Huang, J.; Hua, X.-S.; Zhang, L. Structure aware single-stage 3D object detection from point cloud. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 14–19 June 2020; pp. 11873-11882.

58. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3D object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 14–19 June 2020; pp. 10529-10538.

59. Mao, J.; Niu, M.; Bai, H.; Liang, X.; Xu, H.; Xu, C. Pyramid r-cnn: Towards better performance and adaptability for 3D object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada, 11–17 October 2021; pp. 2723-2732.

60. Guan, T.; Wang, J.; Lan, S.; Chandra, R.; Wu, Z.; Davis, L.; Manocha, D. M3detr: Multi-representation, multi-scale, mutual-relation 3D object detection with transformers. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Waikoloa, HI, USA, 3–8 January 2022; pp. 772-782.

61. Xie, T.; Wang, L.; Wang, K.; Li, R.; Zhang, X.; Zhang, H.; Yang, L.; Liu, H.; Li, J. FARP-Net: Local-global feature aggregation and relation-aware proposals for 3D object detection. IEEE Trans. Multimed.; 2023; 26, pp. 1027-1040. [DOI: https://dx.doi.org/10.1109/TMM.2023.3275366]

62. Paigwar, A.; Sierra-Gonzalez, D.; Erkent, Ö.; Laugier, C. Frustum-pointpillars: A multi-stage approach for 3D object detection using rgb camera and lidar. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada, 11–17 October 2021; pp. 2926-2933.

63. Liu, Z.; Huang, T.; Li, B.; Chen, X.; Wang, X.; Bai, X. Epnet++: Cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell.; 2022; 45, pp. 8324-8341. [DOI: https://dx.doi.org/10.1109/TPAMI.2022.3228806] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37015370]

64. Wang, M.; Zhao, L.; Yue, Y. PA3DNet: 3-D vehicle detection with pseudo shape segmentation and adaptive camera-LiDAR fusion. IEEE Trans. Ind. Inform.; 2023; 19, pp. 10693-10703. [DOI: https://dx.doi.org/10.1109/TII.2023.3241585]

Word count: 10370

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

3D object detection plays a pivotal role in achieving accurate environmental perception, particularly in complex traffic scenarios where single-modal detection methods often fail to meet precision requirements. This highlights the necessity of multi-modal fusion approaches to enhance detection performance. However, existing camera-LiDAR intermediate fusion methods suffer from insufficient interaction between local and global features and limited fine-grained feature extraction capabilities, which results in inadequate small object detection and unstable performance in complex scenes. To address these issues, the multi-modal 3D object detection algorithm with pointwise and voxelwise fusion (MPVF) is proposed, which enhances multi-modal feature interaction and optimizes feature extraction strategies to improve detection precision and robustness. First, the pointwise and voxelwise fusion (PVWF) module is proposed to combine local features from the pointwise fusion (PWF) module with global features from the voxelwise fusion (VWF) module, enhancing the interaction between features across modalities, improving small object detection capabilities, and boosting model performance in complex scenes. Second, an expressive feature extraction module, improved ResNet-101 and feature pyramid (IRFP), is developed, comprising the improved ResNet-101 (IR) and feature pyramid (FP) modules. The IR module uses a group convolution strategy to inject high-level semantic features into the PWF and VWF modules, improving extraction efficiency. The FP module, placed at an intermediate stage, captures fine-grained features at various resolutions, enhancing the model’s precision and robustness. Finally, evaluation on the KITTI dataset demonstrates a mean Average Precision (mAP) of 69.24%, a 2.75% improvement over GraphAlign++. Detection accuracy for cars, pedestrians, and cyclists reaches 85.12%, 48.61%, and 70.12%, respectively, with the proposed method excelling in pedestrian and cyclist detection.

Details

Title

MPVF: Multi-Modal 3D Object Detection Algorithm with Pointwise and Voxelwise Fusion

Author

Shi, Peicheng¹

; Wu, Wenchao¹; Yang, Aixi²

¹ School of Mechanical and Automotive Engineering, Anhui Polytechnic University, Wuhu 241000, China; [email protected]
² Polytechnic Institute, Zhejiang University, Hangzhou 310015, China; [email protected]

First page

172

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

19994893

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/a18030172

ProQuest document ID

3181338145