Abstract:
Video feature learning by deep neural network has become a hot research topic in video semantic analysis such as video object detection, motion recognition and video event detection. The topographic information of the video image plays an important role in describing the relationship between image and content. At the same time, it is helpful to improve the discriminability of the video feature expression by considering the characteristics of the video sequence with optimization. In this paper, an approach based on pre-training convolutional neural network with new topographic sparse encoder is proposed for video feature learning. This method has two stages: semi-supervised video image feature learning and supervised video sequence features optimization learning. In the semi-supervised video image feature learning stage, a new topographic sparse encoder is presented and used to pre-train neural networks, so that the characteristic expression of the video image can reflect the topographic information of the image, and a logistic regression is used to fine-tune the networks parameters using video concept label for video image feature learning. In the supervised video sequence feature optimization learning stage, a fully connected layer for feature learning of video sequence is constructed in order to express the feature of video sequence reasonably. A logistic regression constraint is established to adjust the network parameters in order that the discriminative feature of video sequence can be obtained. The experiments for relative methods are carried out on typical video datasets. The results show that the proposed method has better discriminability for the expression of video features, and can improve the accuracy of video semantic concept detection effectively.