ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (7): 1531-1538.doi: 10.7544/issn1000-1239.2020.20190478

• 图形图像 • 上一篇    下一篇



  1. 1(北京工业大学信息学部 北京 100124);2(中国科学院自动化研究所 北京 100190) (
  • 出版日期: 2020-07-01
  • 基金资助: 

Task-Adaptive End-to-End Networks for Stereo Matching

Li Tong1, Ma Wei1, Xu Shibiao2, Zhang Xiaopeng2   

  1. 1(Faculty of Information Technology, Beijing University of Technology, Beijing 100124);2(Institute of Automation, Chinese Academy of Sciences, Beijing 100190)
  • Online: 2020-07-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61771026, 61671451) and the Open Project Program of the National Laboratory of Pattern Recognition (NLPR).

摘要: 针对现有立体匹配深度网络中特征提取模块冗余度高以及用于视差计算的3D卷积模块感受野受限问题,提出改进的端到端深度网络.相比现有网络,该网络特征提取模块遵循立体匹配特性,结构更简洁;引入分离3D卷积实现大卷积核3D卷积运算以扩充感受野.在SceneFlow数据集上,从匹配精度和计算开销等方面评估所提出网络.实验结果显示:所提出网络在准确度上达到了先进水平;相比现有同类型模块,所提出特征提取模块在保证结果精度的同时能减少90%的参数量,并减少约25%的训练时间;相比3D卷积,所提出的分离3D卷积将卷积核大小提升至覆盖整个视差维度,搭配群组归一化(group normalization, GN),其端点误差(end-point-error, EPE)较基础方法降低了12%的相对量.

关键词: 立体匹配, 视差计算, 特征提取, 3D卷积, 端到端网络

Abstract: Estimating depth/disparity information from stereo pairs via stereo matching is a classical research topic in computer vision. Recently, along with the development of deep learning technologies, many end-to-end deep networks have been proposed for stereo matching. These networks generally borrow convolutional neural network (CNN) structures originally designed for other tasks to extract features. These structures are generally redundant for the task of stereo matching. Besides, 3D convolutions in these networks are too complex to be extended for large perception fields which are helpful for disparity estimation. In order to overcome these problems, we propose a deep network structure based on the properties of stereo matching. In the proposed network, a concise and effective feature extraction module is presented. Moreover, a separated 3D convolution is introduced to avoid parameter explosion caused by increasing the size of convolution kernels. We validate our network on the dataset of SceneFlow in aspects of both accuracy and computation costs. Results show that the proposed network obtains state-of-the-art performance. Compared with the other structures, our feature extraction module can reduce 90% parameters and 25% time cost while achieving comparable accuracy. At the same time, our separated 3D convolution, accompanied by group normalization (GN), achieves lower end-point-error (EPE) than baseline methods.

Key words: stereo matching, disparity estimation, feature extraction, 3D convolution, end-to-end network