基于多模态输入的对抗式视频生成方法

于海涛; 杨小汕; 徐常胜

doi:10.7544/issn1000-1239.2020.20190479

基于多模态输入的对抗式视频生成方法

¹(合肥工业大学计算机与信息学院合肥 230031)
²(模式识别国家重点实验室(中国科学院自动化研究所) 北京 100190) (yuht@mail.hfut.edu.cn)

基金项目: 国家重点研发计划基金项目(2018AAA0100604);国家自然科学基金项目(61702511,61720106006,61728210,61751211,U1836220,U1705262,61872424)；模式识别国家重点实验室自主课题(Z-2018007)

详细信息

中图分类号: TP391
计量
- 文章访问数: 1201
- HTML全文浏览量: 8
- PDF下载量: 332
出版历程
- 发布日期: 2020-06-30

Antagonistic Video Generation Method Based on Multimodal Input

¹(School of Computer and Information, Hefei University of Technology, Hefei 230031)
²(National Laboratory of Pattern Recognition(Institute of Automation, Chinese Academy of Sciences), Bejing 100190)

Funds: This work was supported by the National Key Research and Development Program of China (2018AAA0100604), the National Natural Science Foundation of China (61702511, 61720106006, 61728210, 61751211, U1836220, U1705262, 61872424), and the Research Program of National Laboratory of Pattern Recognition (Z-2018007).

摘要

摘要: 视频生成是计算机视觉和多媒体领域一个重要而又具有挑战性的任务.现有的基于对抗生成网络的视频生成方法通常缺乏一种有效可控的连贯视频生成方式.提出一种新的多模态条件式视频生成模型.该模型使用图片和文本作为输入，通过文本特征编码网络和运动特征解码网络得到视频的运动信息，并结合输入图片生成连贯的运动视频序列.此外，该方法通过对输入图片进行仿射变换来预测视频帧，使得生成模型更加可控、生成结果更加鲁棒.在SBMG(single-digit bouncing MNIST gifs)，TBMG(two-digit bouncing MNIST gifs)和KTH(kungliga tekniska hgskolan human actions)数据集上的实验结果表明：相较于现有的视频生成方法，生成结果在目标清晰度和视频连贯性方面都具有更好的效果.另外定性评估和定量评估(SSIM(structural similarity index)与PSNR(peak signal to noise ratio)指标)表明提出的多模态视频帧生成网络在视频生成中起到了关键作用.
- 深度学习 /
- 视频生成 /
- 视频预测 /
- 卷积神经网络 /
- 生成对抗网络
Abstract: Video generation is an important and challenging task in the field of computer vision and multimedia. The existing video generation methods based on generative adversarial networks (GANs) usually lack an effective scheme to control the coherence of video. The realization of artificial intelligence algorithms that can automatically generate real video is an important indicator of more complete visual appearance information and motion information understanding.A new multi-modal conditional video generation model is proposed in this paper. The model uses pictures and text as input, gets the motion information of video through text feature coding network and motion feature decoding network, and generates video with coherence motion by combining the input images. In addition, the method predicts video frames by affine transformation of input images, which makes the generated model more controllable and the generated results more robust. The experimental results on SBMG (single-digit bouncing MNIST gifs), TBMG(two-digit bouncing MNIST gifs) and KTH(kungliga tekniska hgskolan human actions) datasets show that the proposed method performs better on both the target clarity and the video coherence than existing methods. In addition, qualitative evaluation and quantitative evaluation of SSIM(structural similarity index) and PSNR(peak signal to noise ratio) metrics demonstrate that the proposed multi-modal video frame generation network plays a key role in the generation process.
- deep learning /
- video generation /
- video prediction /
- convolutional neural network /
- generative adversarial network (GAN)

HTML全文

参考文献(0)

施引文献(25)

期刊类型引用(10)

1.	杜金明，孙媛媛，林鸿飞，杨亮. 融入知识图谱和课程学习的对话情绪识别. 计算机研究与发展. 2024(05): 1299-1309 . 本站查看
2.	纪鑫，武同心，王宏刚，杨智伟，何禹德，赵晓龙. 基于多通道图神经网络的属性聚合式实体对齐. 北京航空航天大学学报. 2024(09): 2791-2799 . 百度学术
3.	陈富强，寇嘉敏，苏利敏，李克. 基于图神经网络的多信息优化实体对齐模型. 计算机科学. 2023(03): 34-41 . 百度学术
4.	刘璐，飞龙，高光来. 基于多视图知识表示和神经网络的旅游领域实体对齐方法. 计算机应用研究. 2023(04): 1044-1051 . 百度学术
5.	安靖，司光亚，周杰，韩旭. 基于知识图谱的仿真想定智能生成方法. 指挥与控制学报. 2023(01): 103-109 . 百度学术
6.	孙泽群，崔员宁，胡伟. 基于链接实体回放的多源知识图谱终身表示学习. 软件学报. 2023(10): 4501-4517 . 百度学术
7.	时慧芳. 融合高速路门机制的跨语言实体对齐研究. 现代电子技术. 2023(20): 167-172 . 百度学术
8.	张富，杨琳艳，李健伟，程经纬. 实体对齐研究综述. 计算机学报. 2022(06): 1195-1225 . 百度学术
9.	姜亚莉，戴齐，刘捷. 基于交叉图匹配和双向自适应迭代的实体对齐. 信息与电脑(理论版). 2022(20): 201-204 . 百度学术
10.	王小鹏. 基于知识图谱的择优分段迭代式实体对齐方法研究. 信息与电脑(理论版). 2021(18): 48-52 . 百度学术