ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2019, Vol. 56 ›› Issue (5): 1071-1081.doi: 10.7544/issn1000-1239.2019.20180463

• 人工智能 • 上一篇    下一篇

视听相关的多模态概念检测

奠雨洁,金琴   

  1. (中国人民大学信息学院 北京 100872) (dianyujie-blair@ruc.edu.cn)
  • 出版日期: 2019-05-01
  • 基金资助: 
    国家自然科学基金项目(61772535);国家重点研发计划基金项目(2016YFB1001202)

Audio-Visual Correlated Multimodal Concept Detection

Dian Yujie, Jin Qin   

  1. (School of Information, Renmin University of China, Beijing 100872)
  • Online: 2019-05-01

摘要: 随着在线视频应用的流行,互联网上的视频数量快速增长.面对互联网上海量的视频,人们对视频检索的要求也越来越精细化.如何按照合适的语义概念对视频进行组织和管理,从而帮助用户更高效、更准确地获取所需视频,成为亟待解决的问题.在大量的应用场景下,需要声音和视觉同时出现才能确定某个视频事件.因此,提出具有视听信息的多模态概念的检测工作.首先,以名词-动词二元组的形式定义多模态概念,其中名词表达了视觉信息,动词表达了听觉信息,且名词和动词具有语义相关性,共同表达语义概念所描述的事件.其次,利用卷积神经网络,以多模态概念的视听相关性为目标训练多模态联合网络,进行端到端的多模态概念检测.实验表明:在多模态概念检测任务上,通过视听相关的联合网络的性能超过了单独的视觉网络和听觉网络.同时,联合网络能够学习到精细化的特征表示,利用该网络提取的视觉特征,在Huawei视频数据集某些特定的类别上超过ImageNet预训练的神经网络特征;联合网络提取的音频特征,在ESC50数据集上,也超过在Youtube8m上训练的神经网络音频特征约5.7%.

关键词: 多模态信息, 语义概念, 视频概念检测, 视频特征, 视频语义理解

Abstract: With the wide dissemination of online video sharing applications, massive number of videos are generated online every day. Facing such massive videos, people require more refined retrieval services. How to organize and manage such massive videos on the Internet to help users retrieve videos more efficiently and accurately has become one of the most challenging topics in video analysis. In most scenarios, it is necessary that sounds and visual information appear simultaneously to decide a video event. Therefore, this paper proposes multimodal concept detection task based on audio-visual information. Firstly, a multimodal concept is defined as a noun-verb pair, in which the noun and verb represent visual and audio information separately. The audio and visual information in a multimodal concept is correlated. Secondly, this paper performs end-to-end multimodal concept detection using convolutional neural network. Specifically, audio-visual correlation is considered to train a joint learning network. The experimental results show that performance of the joint network via audio-visual correlation exceeds that of single visual or audio network. Thirdly, the joint network learns fine-grained features. In the Huawei video concept detection task, using visual features extracted from the joint network outperforms features extracted from an ImageNet pre-trained network on some specific concepts. In the ESC 50 audio classification task, acoustic features from the joint network exceeds that from VGG pre-trained on Youtube8m about 5.7%.

Key words: multimodal information, semantic concepts, video concept detection, video representation, video semantic understanding

中图分类号: