Abstract:
In the field of complex action recognition in videos, the structural design of the model plays a crucial role in its final performance. However, manually designed network structures often rely heavily on the knowledge and experience of researchers. Therefore, neural architecture search (NAS) has received widespread attention from researchers in the field of image processing because of its automated network structure design. Currently, neural architecture search has achieved tremendous development in the image field. Some NAS methods even reduce the number of graphics processing unit (GPU) days required for automated model design to single digits, and the model structures they search show strong competitive potential. This encourages us to extend automated model structure design to the video domain. But it faces two serious challenges: 1) How to capture the long-range contextual temporal association in video as much as possible; 2) How to reduce the computational surge caused by 3D convolution as much as possible. To address the above challenges, we propose a novel Neural Architecture Search on Temporal Convolutions for Complex Action Recognition (NAS-TC). NAS- TC is a two-stage framework. In the first stage, we use the classic convolutional neural network (CNN) network as the backbone network to complete the computationally intensive feature extraction task. In the second stage, we propose a neural architecture search layer temporal convolutional layer (NAS-TC) to accomplish relatively lightweight long-range temporal model design and information extraction. This ensures that our method will have a more reasonable parameter allocation and can handle minute-level videos. Finally, the method we proposed achieved an average performance gain of 2.3% mAP on three complex action recognition benchmark data sets compared with similar methods, and the number of parameters was reduced by 28.5%.