• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

基于时间卷积神经架构搜索的复杂动作识别

任鹏真, 梁小丹, 常晓军, 肖云

任鹏真, 梁小丹, 常晓军, 肖云. 基于时间卷积神经架构搜索的复杂动作识别[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202440048
引用本文: 任鹏真, 梁小丹, 常晓军, 肖云. 基于时间卷积神经架构搜索的复杂动作识别[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202440048
Neural Architecture Search on Temporal Convolutions for Complex Action Recognition[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440048
Citation: Neural Architecture Search on Temporal Convolutions for Complex Action Recognition[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440048

基于时间卷积神经架构搜索的复杂动作识别

基金项目: 国家级-中国博士后科学基金第73批面上资助(2023M734009)

Neural Architecture Search on Temporal Convolutions for Complex Action Recognition

  • 摘要: 在视频的复杂动作识别领域中,模型的结构设计对其最终的性能起着至关重要的作用。然而,人工设计的网络结构往往严重依赖于研究人员的知识和经验。因此,神经架构搜索(Neural architecture search,NAS)因其自动化的网络结构设计在图像处理领域受到研究人员的广泛关注。当前,神经架构搜索已经在图像领域获得了巨大的发展,一些NAS方法甚至将模型自动化设计所需的GPU天数减少到了个位数,并且其搜索的模型结构表现出了强大的竞争潜力。这鼓励我们将自动化模型结构设计拓展到视频领域。但它面临两个严峻的挑战:(1)如何尽可能捕获视频中的长程上下文时间关联;(2)如何尽可能降低3D卷积所带来的计算激增的问题。为了应对上述挑战,我们提出了一个新颖的基于时间卷积的神经架构搜索复杂动作识别(Neural Architecture Search on Temporal Convolutions for Complex Action Recognition,NAS-TC)模型。具体的,NAS-TC具有两个阶段:在第一阶段,我们采用经典的CNN网络作为主干网络,来完成计算密集型的特征提取任务。在第二阶段,我们提出了一个神经架构搜索层时间卷积层(NAS-TC)来完成相对轻量级的长程时间模型设计和信息提取。这确保了我们的方法将具有更合理的参数分配并且可以处理分钟级的视频。最后,提出的方法在三个复杂动作识别基准数据集上和同类型方法相比平均获得了2.3\%mAP的性能增益,并且参数量下降了28.5\%。
    Abstract: In the field of complex action recognition in videos, the structural design of the model plays a crucial role in its final performance. However, manually designed network structures often rely heavily on the knowledge and experience of researchers. Therefore, \textit{Neural architecture search} (NAS) has received widespread attention from researchers in the field of image processing because of its automated network structure design. Currently, neural architecture search has achieved tremendous development in the image field. Some NAS methods even reduce the number of GPU days required for automated model design to single digits, and the model structures they search show strong competitive potential. This encourages us to extend automated model structure design to the video domain. But it faces two serious challenges: (1) How to capture the long-range contextual temporal association in video as much as possible; (2) How to reduce the computational surge caused by 3D convolution as much as possible. To address the above challenges, we propose a novel Neural Architecture Search on Temporal Convolutions for Complex Action Recognition (NAS-TC). NAS-TC is a two-stage framework. In the first stage, we use the classic CNN network as the backbone network to complete the computationally intensive feature extraction task. In the second stage, we propose a Neural Architecture Search Layer Temporal Convolutional Layer (NAS-TC) to accomplish relatively lightweight long-range temporal model design and information extraction. This ensures that our method will have a more reasonable parameter allocation and can handle minute-level videos. Finally, the method we proposed achieved an average performance gain of 2.3\% mAP on three complex action recognition benchmark data sets compared with similar methods, and the number of parameters was reduced by 28.5\%.
计量
  • 文章访问数:  10
  • HTML全文浏览量:  0
  • PDF下载量:  5
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-01-28
  • 网络出版日期:  2025-03-02

目录

    /

    返回文章
    返回