Abstract:
Multi-modal 3D object detection is a pivotal task in autonomous driving and environment perception. It achieves accurate localization and recognition of objects in 3D space by integrating data from multiple sensors, such as LiDAR (light detection and ranging) point clouds and camera images. To address the common challenges of large scale variations and insufficient multi-scale feature extraction in existing methods, this paper proposes a multi-modal 3D object detection method based on dynamic scaling convolution. For this method, we design a dual-branch multi-modal fusion architecture, which consists of a VoxelNet-based 3D backbone and a novel dynamic scaling network. The two branches collaboratively process LiDAR point clouds and image-derived virtual point clouds to achieve cross-modal feature alignment and adaptive fusion. The core of the dynamic scaling network is a dynamic scaling module, which integrates 3D and 2D dynamic scaling convolution operators specifically tailored for multi-modal feature processing. This module adaptively adjusts the sampling locations and the number of valid sampling points within the receptive fields of 3D and 2D branches, thus effectively capturing structural and semantic variations of multi-scale objects. Extensive experiments on two benchmark datasets demonstrate the proposed method outperforms more than 10 prevailing methods in terms of AP (average precision). Specifically, it achieves mAP
3D (3D mean average precision) scores of 86.87%, 46.68%, and 68.39% for the car, pedestrian, and cyclist categories on the KITTI test set, and achieves a mAP (mean average precision) of 71.8% on the nuScenes test set. These results demonstrate the superior detection performance and strong generalization ability of the proposed method for multi-modal 3D object detection tasks.