Abstract:
With the wide dissemination of online video sharing applications, massive number of videos are generated online every day. Facing such massive videos, people require more refined retrieval services. How to organize and manage such massive videos on the Internet to help users retrieve videos more efficiently and accurately has become one of the most challenging topics in video analysis. In most scenarios, it is necessary that sounds and visual information appear simultaneously to decide a video event. Therefore, this paper proposes multimodal concept detection task based on audio-visual information. Firstly, a multimodal concept is defined as a noun-verb pair, in which the noun and verb represent visual and audio information separately. The audio and visual information in a multimodal concept is correlated. Secondly, this paper performs end-to-end multimodal concept detection using convolutional neural network. Specifically, audio-visual correlation is considered to train a joint learning network. The experimental results show that performance of the joint network via audio-visual correlation exceeds that of single visual or audio network. Thirdly, the joint network learns fine-grained features. In the Huawei video concept detection task, using visual features extracted from the joint network outperforms features extracted from an ImageNet pre-trained network on some specific concepts. In the ESC 50 audio classification task, acoustic features from the joint network exceeds that from VGG pre-trained on Youtube8m about 5.7%.