ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (7): 1522-1530.doi: 10.7544/issn1000-1239.2020.20190479

Previous Articles     Next Articles

Antagonistic Video Generation Method Based on Multimodal Input

Yu Haitao1, Yang Xiaoshan2, Xu Changsheng1,2   

  1. 1(School of Computer and Information, Hefei University of Technology, Hefei 230031);2(National Laboratory of Pattern Recognition(Institute of Automation, Chinese Academy of Sciences), Bejing 100190)
  • Online:2020-07-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018AAA0100604), the National Natural Science Foundation of China (61702511, 61720106006, 61728210, 61751211, U1836220, U1705262, 61872424), and the Research Program of National Laboratory of Pattern Recognition (Z-2018007).

Abstract: Video generation is an important and challenging task in the field of computer vision and multimedia. The existing video generation methods based on generative adversarial networks (GANs) usually lack an effective scheme to control the coherence of video. The realization of artificial intelligence algorithms that can automatically generate real video is an important indicator of more complete visual appearance information and motion information understanding.A new multi-modal conditional video generation model is proposed in this paper. The model uses pictures and text as input, gets the motion information of video through text feature coding network and motion feature decoding network, and generates video with coherence motion by combining the input images. In addition, the method predicts video frames by affine transformation of input images, which makes the generated model more controllable and the generated results more robust. The experimental results on SBMG (single-digit bouncing MNIST gifs), TBMG(two-digit bouncing MNIST gifs) and KTH(kungliga tekniska hgskolan human actions) datasets show that the proposed method performs better on both the target clarity and the video coherence than existing methods. In addition, qualitative evaluation and quantitative evaluation of SSIM(structural similarity index) and PSNR(peak signal to noise ratio) metrics demonstrate that the proposed multi-modal video frame generation network plays a key role in the generation process.

Key words: deep learning, video generation, video prediction, convolutional neural network, generative adversarial network (GAN)

CLC Number: