Full text

Turn on search term navigation

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

3D object detection plays a pivotal role in achieving accurate environmental perception, particularly in complex traffic scenarios where single-modal detection methods often fail to meet precision requirements. This highlights the necessity of multi-modal fusion approaches to enhance detection performance. However, existing camera-LiDAR intermediate fusion methods suffer from insufficient interaction between local and global features and limited fine-grained feature extraction capabilities, which results in inadequate small object detection and unstable performance in complex scenes. To address these issues, the multi-modal 3D object detection algorithm with pointwise and voxelwise fusion (MPVF) is proposed, which enhances multi-modal feature interaction and optimizes feature extraction strategies to improve detection precision and robustness. First, the pointwise and voxelwise fusion (PVWF) module is proposed to combine local features from the pointwise fusion (PWF) module with global features from the voxelwise fusion (VWF) module, enhancing the interaction between features across modalities, improving small object detection capabilities, and boosting model performance in complex scenes. Second, an expressive feature extraction module, improved ResNet-101 and feature pyramid (IRFP), is developed, comprising the improved ResNet-101 (IR) and feature pyramid (FP) modules. The IR module uses a group convolution strategy to inject high-level semantic features into the PWF and VWF modules, improving extraction efficiency. The FP module, placed at an intermediate stage, captures fine-grained features at various resolutions, enhancing the model’s precision and robustness. Finally, evaluation on the KITTI dataset demonstrates a mean Average Precision (mAP) of 69.24%, a 2.75% improvement over GraphAlign++. Detection accuracy for cars, pedestrians, and cyclists reaches 85.12%, 48.61%, and 70.12%, respectively, with the proposed method excelling in pedestrian and cyclist detection.

Details

Title
MPVF: Multi-Modal 3D Object Detection Algorithm with Pointwise and Voxelwise Fusion
Author
Shi, Peicheng 1   VIAFID ORCID Logo  ; Wu, Wenchao 1 ; Yang, Aixi 2   VIAFID ORCID Logo 

 School of Mechanical and Automotive Engineering, Anhui Polytechnic University, Wuhu 241000, China; [email protected] 
 Polytechnic Institute, Zhejiang University, Hangzhou 310015, China; [email protected] 
First page
172
Publication year
2025
Publication date
2025
Publisher
MDPI AG
e-ISSN
19994893
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3181338145
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.