基于拓扑稀疏编码预训练CNN的视频语义分析

程晓阳; 詹永照; 毛启容; 詹智财

doi:10.7544/issn1000-1239.2018.20170579

基于拓扑稀疏编码预训练CNN的视频语义分析

Video Semantic Analysis Based on Topographic Sparse Pre-Training CNN

摘要

摘要: 视频特征的深度学习已成为视频对象检测、动作识别、视频事件检测等视频语义分析方面的研究热点.视频图像的拓扑信息对描述图像内容的关联关系有着重要的作用，同时综合视频序列特性考虑以有标签的视频进行优化学习，将有利于提高视频特征表达的可鉴别性.基于上述考虑，提出一种基于拓扑稀疏编码预训练CNN的视频特征学习方法并用于视频语义分析，该方法将视频特征学习分为2个阶段：半监督视频图像特征学习和有监督的视频序列特征的优化学习.1)在半监督视频图像特征学习中，构建了一个新的拓扑稀疏编码器用之于预训练各层神经网络参数，使视频图像的特征表达能反映图像的拓扑信息，并在图像特征学习的全连接层以有标签的视频概念类别进行逻辑回归微调网络参数.2)在有监督的视频序列特征的优化学习中，构建了视频特征学习的全连接层，综合有标签的视频序列关键帧特征，建立逻辑回归约束，微调网络参数，以实现类别更具可鉴别的视频特征的优化.在典型的视频数据集上进行了相关方法的视频语义概念检测实验，实验结果表明:所提出的方法对视频特征的表达更具可鉴别性，能有效提高视频语义概念检测率.

Abstract: Video feature learning by deep neural network has become a hot research topic in video semantic analysis such as video object detection, motion recognition and video event detection. The topographic information of the video image plays an important role in describing the relationship between image and content. At the same time, it is helpful to improve the discriminability of the video feature expression by considering the characteristics of the video sequence with optimization. In this paper, an approach based on pre-training convolutional neural network with new topographic sparse encoder is proposed for video feature learning. This method has two stages: semi-supervised video image feature learning and supervised video sequence features optimization learning. In the semi-supervised video image feature learning stage, a new topographic sparse encoder is presented and used to pre-train neural networks, so that the characteristic expression of the video image can reflect the topographic information of the image, and a logistic regression is used to fine-tune the networks parameters using video concept label for video image feature learning. In the supervised video sequence feature optimization learning stage, a fully connected layer for feature learning of video sequence is constructed in order to express the feature of video sequence reasonably. A logistic regression constraint is established to adjust the network parameters in order that the discriminative feature of video sequence can be obtained. The experiments for relative methods are carried out on typical video datasets. The results show that the proposed method has better discriminability for the expression of video features, and can improve the accuracy of video semantic concept detection effectively.

HTML全文

参考文献(0)

施引文献

资源附件(0)