Abstract:
Multi-modal 3D object detection stands as a pivotal task in autonomous driving and environmental perception. It achieves accurate localization and recognition of objects in 3D space by integrating data from multiple sen-sors such as LiDAR point clouds and camera images. To address prevalent challenges including significant scale variations and suboptimal feature extraction in existing methods, this paper proposes a multi-modal 3D object detection approach based on dynamic scaling convolution. The method designs a dual-branch mul-ti-modal fusion model, composed of the VoxelNet and a novel dynamic scaling network, which collaboratively processes LiDAR point clouds and image-derived virtual point clouds to achieve cross-modal feature alignment and adaptive fusion. The core of the dynamic scaling network is a dynamic scaling module that incorporates 3D and 2D dynamic scaling convolutions tailored for multi-modal data. This module adaptively adjusts sampling locations and the number of valid sampling points within the receptive field across both 3D and 2D dimensions, capturing structural and semantic variations of multi-scale objects. Extensive experiments on two benchmark datasets demonstrate that the proposed method outperforms more than ten prevailing approaches in terms of average precision. Specifically, it achieves 3D detection average precision (AP) scores of 86.87%, 46.68%, and 68.39% for the car, pedestrian, and cyclist categories, respectively, on the KITTI test set, and achieves a mean average precision (mAP) of 71.8% on the nuScenes test set. These results effectively validate the detection and generalization capabilities of the proposed method in multi-modal 3D object detection tasks.