视听相关的多模态概念检测

奠雨洁; 金琴

doi:10.7544/issn1000-1239.2019.20180463

视听相关的多模态概念检测

奠雨洁,
金琴

(中国人民大学信息学院北京 100872) (dianyujie-blair@ruc.edu.cn)

基金项目: 国家自然科学基金项目(61772535)；国家重点研发计划基金项目(2016YFB1001202)

详细信息

中图分类号: TP391
计量
- 文章访问数: 881
- HTML全文浏览量: 2
- PDF下载量: 333
出版历程
- 发布日期: 2019-04-30

Audio-Visual Correlated Multimodal Concept Detection

Dian Yujie,
Jin Qin

(School of Information, Renmin University of China, Beijing 100872)

摘要

摘要: 随着在线视频应用的流行，互联网上的视频数量快速增长.面对互联网上海量的视频，人们对视频检索的要求也越来越精细化.如何按照合适的语义概念对视频进行组织和管理，从而帮助用户更高效、更准确地获取所需视频，成为亟待解决的问题.在大量的应用场景下，需要声音和视觉同时出现才能确定某个视频事件.因此，提出具有视听信息的多模态概念的检测工作.首先，以名词-动词二元组的形式定义多模态概念，其中名词表达了视觉信息，动词表达了听觉信息，且名词和动词具有语义相关性，共同表达语义概念所描述的事件.其次，利用卷积神经网络，以多模态概念的视听相关性为目标训练多模态联合网络，进行端到端的多模态概念检测.实验表明：在多模态概念检测任务上，通过视听相关的联合网络的性能超过了单独的视觉网络和听觉网络.同时，联合网络能够学习到精细化的特征表示，利用该网络提取的视觉特征，在Huawei视频数据集某些特定的类别上超过ImageNet预训练的神经网络特征；联合网络提取的音频特征，在ESC50数据集上，也超过在Youtube8m上训练的神经网络音频特征约5.7%.
- 多模态信息 /
- 语义概念 /
- 视频概念检测 /
- 视频特征 /
- 视频语义理解
Abstract: With the wide dissemination of online video sharing applications, massive number of videos are generated online every day. Facing such massive videos, people require more refined retrieval services. How to organize and manage such massive videos on the Internet to help users retrieve videos more efficiently and accurately has become one of the most challenging topics in video analysis. In most scenarios, it is necessary that sounds and visual information appear simultaneously to decide a video event. Therefore, this paper proposes multimodal concept detection task based on audio-visual information. Firstly, a multimodal concept is defined as a noun-verb pair, in which the noun and verb represent visual and audio information separately. The audio and visual information in a multimodal concept is correlated. Secondly, this paper performs end-to-end multimodal concept detection using convolutional neural network. Specifically, audio-visual correlation is considered to train a joint learning network. The experimental results show that performance of the joint network via audio-visual correlation exceeds that of single visual or audio network. Thirdly, the joint network learns fine-grained features. In the Huawei video concept detection task, using visual features extracted from the joint network outperforms features extracted from an ImageNet pre-trained network on some specific concepts. In the ESC 50 audio classification task, acoustic features from the joint network exceeds that from VGG pre-trained on Youtube8m about 5.7%.
- multimodal information /
- semantic concepts /
- video concept detection /
- video representation /
- video semantic understanding

HTML全文

参考文献(0)

施引文献(18)

期刊类型引用(7)

1.	杨秀璋，武帅，宋籍文，廖文婧，任天舒，刘建义. 基于LDA和关系图谱的数据治理文献主题演化研究. 信息技术与信息化. 2022(08): 6-12 . 百度学术
2.	黄飞杰，张卫东，侯石鹏，宋红文. 基于GSP算法的卷烟消费者研究. 信息与电脑(理论版). 2022(16): 58-60 . 百度学术
3.	张瑾，朱桂祥，王宇琛，郑烁佳，陈镜潞. 基于异质图表达学习的跨境电商推荐模型. 电子与信息学报. 2022(11): 4008-4017 . 百度学术
4.	冯晨娇，宋鹏，王智强，梁吉业. 一种基于3因素概率图模型的长尾推荐方法. 计算机研究与发展. 2021(09): 1975-1986 . 本站查看
5.	牛俊洁，崔忠伟，赵晨洁，王永金，吴恋. 个性化旅游推荐技术研究及发展综述. 物联网技术. 2020(03): 86-88+91 . 百度学术
6.	史亚奇. 基于人性化特征的旅游地智能推荐系统. 现代电子技术. 2020(11): 183-186 . 百度学术
7.	张如花，屈正庚. 基于AHP的旅游网站评价体系研究. 甘肃科学学报. 2019(05): 32-36 . 百度学术