基于动态缩放卷积的多模态3D目标检测

纪文宇; 苗壮; 张启阳; 崔小璐; 李唯暄; 李阳

doi:10.7544/issn1000-1239.202550676

基于动态缩放卷积的多模态3D目标检测

Multi-Modal 3D Object Detection Based on Dynamic Scaling Convolution

摘要

摘要: 作为自动驾驶与环境感知中的核心任务，多模态3D目标检测通过融合激光雷达点云与相机图像等多种传感器数据，来实现对3D空间目标的精准定位与识别。为解决现有方法中普遍存在的目标规模差异与多尺度特征提取不足问题，提出了一种基于动态缩放卷积的多模态3D目标检测方法。该方法设计了一种多模态融合双分支模型，由体素网络和所设计的动态缩放网络组成，能够协同处理激光雷达点云和图像衍生的虚拟点云，实现跨模态特征对齐与自适应融合。动态缩放网络的核心是动态缩放模块，该模块包含适用于多模态数据的3D动态缩放卷积和2D动态缩放卷积，能够在3D与2D分支上自适应地调整采样位置和感受野内有效采样点数量，捕捉多尺度目标的结构与语义差异。在2个数据集上与现有10余种主流方法的对比结果表明，该方法的平均精度均超过现有方法。具体地，在KITTI测试集汽车、行人和骑行者类别上分别达到86.87%，46.68%，68.39%的3D目标检测平均精度（AP_3D），在nuScenes测试集上达到71.8%的平均精度（mAP），有效证明了所提方法在多模态3D目标检测任务中的检测与泛化性能。

Abstract: Multi-modal 3D object detection is a pivotal task in autonomous driving and environment perception. It achieves accurate localization and recognition of objects in 3D space by integrating data from multiple sensors, such as LiDAR (light detection and ranging) point clouds and camera images. To address the common challenges of large scale variations and insufficient multi-scale feature extraction in existing methods, this paper proposes a multi-modal 3D object detection method based on dynamic scaling convolution. For this method, we design a dual-branch multi-modal fusion architecture, which consists of a VoxelNet-based 3D backbone and a novel dynamic scaling network. The two branches collaboratively process LiDAR point clouds and image-derived virtual point clouds to achieve cross-modal feature alignment and adaptive fusion. The core of the dynamic scaling network is a dynamic scaling module, which integrates 3D and 2D dynamic scaling convolution operators specifically tailored for multi-modal feature processing. This module adaptively adjusts the sampling locations and the number of valid sampling points within the receptive fields of 3D and 2D branches, thus effectively capturing structural and semantic variations of multi-scale objects. Extensive experiments on two benchmark datasets demonstrate the proposed method outperforms more than 10 prevailing methods in terms of AP (average precision). Specifically, it achieves mAP_3D (3D mean average precision) scores of 86.87%, 46.68%, and 68.39% for the car, pedestrian, and cyclist categories on the KITTI test set, and achieves a mAP (mean average precision) of 71.8% on the nuScenes test set. These results demonstrate the superior detection performance and strong generalization ability of the proposed method for multi-modal 3D object detection tasks.

HTML全文

参考文献(79)

施引文献

资源附件(0)