基于动态缩放卷积的多模态3D目标检测

纪文宇; 苗壮; 张启阳; 崔小璐; 李唯暄; 李阳

doi:10.7544/issn1000-1239.202550676

基于动态缩放卷积的多模态3D目标检测

Multi-modal 3D Object Detection based on Dynamic Scaling Convolution

摘要

摘要: 多模态3D目标检测是自动驾驶与环境感知中的核心任务，通过融合激光雷达点云与相机图像等多种传感器数据，以实现对三维空间中目标的精准定位与识别.针对现有方法中目标尺度差异大、特征提取效果不佳等问题，本文提出了一种基于动态缩放卷积的多模态3D目标检测方法.该方法设计了一种多模态融合双分支模型，由体素网络和所设计的动态缩放网络组成，能够协同处理激光雷达点云和图像衍生的虚拟点云，实现跨模态特征的对齐与自适应融合.动态缩放网络的核心是动态缩放模块，该模块包含适用于多模态数据的3D动态缩放卷积和2D动态缩放卷积，能够在3D与2D维度上自适应地调整采样位置和感受野内有效采样点数量，捕捉多尺度目标的结构与语义差异.本文在2个数据集上与现有十余种主流方法进行对比，平均精度均超过现有方法.具体地，在KITTI测试集汽车、行人和骑行者类别上分别达到86.87%、46.68%和68.39%的3D检测平均精度（AP），在nuScenes测试集上达到71.8%的平均精度（mAP），有效证明了所提方法在多模态3D目标检测任务中的检测与泛化性能.

Abstract: Multi-modal 3D object detection stands as a pivotal task in autonomous driving and environmental perception. It achieves accurate localization and recognition of objects in 3D space by integrating data from multiple sen-sors such as LiDAR point clouds and camera images. To address prevalent challenges including significant scale variations and suboptimal feature extraction in existing methods, this paper proposes a multi-modal 3D object detection approach based on dynamic scaling convolution. The method designs a dual-branch mul-ti-modal fusion model, composed of the VoxelNet and a novel dynamic scaling network, which collaboratively processes LiDAR point clouds and image-derived virtual point clouds to achieve cross-modal feature alignment and adaptive fusion. The core of the dynamic scaling network is a dynamic scaling module that incorporates 3D and 2D dynamic scaling convolutions tailored for multi-modal data. This module adaptively adjusts sampling locations and the number of valid sampling points within the receptive field across both 3D and 2D dimensions, capturing structural and semantic variations of multi-scale objects. Extensive experiments on two benchmark datasets demonstrate that the proposed method outperforms more than ten prevailing approaches in terms of average precision. Specifically, it achieves 3D detection average precision (AP) scores of 86.87%, 46.68%, and 68.39% for the car, pedestrian, and cyclist categories, respectively, on the KITTI test set, and achieves a mean average precision (mAP) of 71.8% on the nuScenes test set. These results effectively validate the detection and generalization capabilities of the proposed method in multi-modal 3D object detection tasks.

HTML全文

参考文献(0)

施引文献

资源附件(0)