基于时间卷积神经架构搜索的复杂动作识别

任鹏真; 梁小丹; 常晓军; 赵子莹; 肖云

doi:10.7544/issn1000-1239.202440048

基于时间卷积神经架构搜索的复杂动作识别

Neural Architecture Search on Temporal Convolutions for Complex Action Recognition

摘要

摘要: 在视频的复杂动作识别领域中，模型的结构设计对其最终的性能起着至关重要的作用. 然而，人工设计的网络结构往往严重依赖于研究人员的知识和经验. 因此，神经架构搜索（neural architecture search，NAS）因其自动化的网络结构设计在图像处理领域受到研究人员的广泛关注. 当前，神经架构搜索已经在图像领域获得了巨大的发展，一些 NAS方法甚至将模型自动化设计所需的 GPU天数减少到了个位数，并且其搜索的模型结构表现出了强大的竞争潜力，这鼓励将自动化模型结构设计拓展到视频领域. 但它面临2个严峻的挑战：1）如何尽可能捕获视频中的长程上下文时间关联；2）如何尽可能降低 3D卷积所带来的计算激增的问题. 为了应对上述挑战，提出了一个基于时间卷积的神经架构搜索复杂动作识别（neural architecture search on temporal convolutions for complex action recognition，NAS-TC）模型. 具体地，NAS-TC具有2个阶段：在第1阶段，采用经典的CNN 网络作为骨干网络，来完成计算密集型的特征提取任务. 在第2阶段，提出了一个神经架构搜索时间卷积层来完成相对轻量级的长程时间模型设计和信息提取. 这确保了提出的方法具有更合理的参数分配并且可以处理分钟级的视频. 最后，提出的方法在3个复杂动作识别基准数据集上与同类型方法相比平均获得了2.3个百分点的mAP的性能增益，并且参数量下降了28.5%.

Abstract: In the field of complex action recognition in videos, the structural design of the model plays a crucial role in its final performance. However, manually designed network structures often rely heavily on the knowledge and experience of researchers. Therefore, neural architecture search (NAS) has received widespread attention from researchers in the field of image processing because of its automated network structure design. Currently, neural architecture search has achieved tremendous development in the image field. Some NAS methods even reduce the number of graphics processing unit (GPU) days required for automated model design to single digits, and the model structures they search show strong competitive potential. This encourages us to extend automated model structure design to the video domain. But it faces two serious challenges: 1) How to capture the long-range contextual temporal association in video as much as possible; 2) How to reduce the computational surge caused by 3D convolution as much as possible. To address the above challenges, we propose a novel neural architecture search on temporal convolutions for complex action recognition (NAS-TC). NAS-TC is a two-stage framework. In the first stage, we use the classic convolutional neural network (CNN) network as the backbone network to complete the computationally intensive feature extraction task. In the second stage, we propose a neural architecture search layer temporal convolutional layer (NAS-TC) to accomplish relatively lightweight long-range temporal model design and information extraction. This ensures that our method will have a more reasonable parameter allocation and can handle minute-level videos. Finally, the proposed method achieves an average performance gain of 2.3% mAP on three complex action recognition benchmark data sets compared with similar methods, and the number of parameters is reduced by 28.5%.

HTML全文

参考文献(79)

施引文献

资源附件(0)