Open-Vocabulary Multi-Label Action Recognition Guided by LLM Knowledge

Zhu Rongjiang; Shi Yuheng; Yang Shuo; Wang Ziyi; Wu Xinxiao

doi:10.7544/issn1000-1239.202440522

Journal of Computer Research and Development > 2025 > Accepted Manuscript > DOI: 10.7544/issn1000-1239.202440522 CSTR: 32373.14.issn1000-1239.202440522

Zhu Rongjiang, Shi Yuheng, Yang Shuo, Wang Ziyi, Wu Xinxiao. Open-Vocabulary Multi-Label Action Recognition Guided by LLM Knowledge[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440522

Citation:

PDF (2540 KB)

Open-Vocabulary Multi-Label Action Recognition Guided by LLM Knowledge

Zhu Rongjiang^{1, 2,},
Shi Yuheng^{1, 2},
Yang Shuo^{3, 4},
Wang Ziyi^{1, 2},
Wu Xinxiao^{1, 2, ,}

1.
School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081
2.
Beijing Key Laboratory of Intelligent Information Technology (Beijing Institute of Technology), Beijing 100081
3.
Shenzhen MSU-BIT University, Shenzhen, Guangdong, 518172
4.
Guangdong Laboratory of Machine Perception and Intelligent Computing (Shenzhen MSU-BIT University), Shenzhen, Guangdong 518172

More Information

Author Bio:
Zhu Rongjiang: born in 2001. Master candidate. His main research interests include action recognition

Shi Yuheng: born in 2000. Master. His main research interests include action recognition

Yang Shuo: born in 1992. PhD. Associate professor. His main research interests include video grounding

Wang Ziyi: born in 2001. Master candidate. His main research interests include domain adaptation and domain generalization

Wu Xinxiao: born in 1982. Professor and PhD. Member of CCF. Her main research interests include computer vision, machine learning, and video understanding
Received Date: June 16, 2024
Revised Date: August 21, 2024
Accepted Date: January 08, 2025
Available Online: January 08, 2025

Graphical Abstract

Abstract

Abstract

Open-vocabulary multi-label action recognition tasks aim to identify various human actions in videos that were not seen during the training phase. Compared to traditional action recognition, this task is more practical as it closely mirrors real-world scenarios and has broader application prospects. However, it poses significant challenges in effectively generalizing models to unseen action categories. To address this issue, this paper proposes an open-vocabulary multi-label action recognition method enhanced by the knowledge of large language models knowledge. This method extracts rich co-occurrence knowledge of action categories implicit in large language models and incorporates this co-occurrence knowledge into prompt learning of visual-language models, facilitating information transfer between base classes and novel classes to improve the recognition performance of novel classes. We set up two ratios of base action classes to novel action classes in experiments, namely 3꞉1 and 1꞉1, represented as "75% seen" and "50% seen" respectively. Experimental results on the AVA and MovieNet datasets show that compared to existing methods, when the base action classes are "75% seen", our method improves the mAP metric for novel action recognition by 1.95% and 1.21% on the AVA and MovieNet datasets, respectively. When faced with the more challenging scenario of "50% seen", our method improves the mAP metric for novel action recognition by 2.59% and 1.06% on the two datasets, respectively.
- open-vocabulary action recognition,
- multi-label classification,
- prompt learning,
- large language model,
- CLIP model

FullText(HTML)

References (32)

References

[1]	Wang Limin, Tong Zhan, Ji Bin, et al. TDN: Temporal difference networks for efficient action recognition[C]//Proc of the 30th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1895−1904
[2]	Wang Mengmeng, Xing Jiazheng, Liu Yong. Actionclip: A new paradigm for video action recognition[J]. arXiv preprint, arXiv: 2109.08472, 2021
[3]	Munro J, Damen D. Multi-modal domain adaptation for fine-grained action recognition[C]//Proc of the 29th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 122−132
[4]	Wang Xiang, Zhang Shiwei, Qing Zhiwu, et al. Molo: Motion-augmented long-short contrastive learning for few-shot action recognition[C]//Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 18011−18021
[5]	王冲,魏子令,陈曙晖. 基于自注意力机制的无边界应用动作识别方法[J]. 计算机研究与发展,2022,59(5):1092−1104 doi: 10.7544/issn1000-1239.20211158 Wang Chong, Wei Ziling, Chen Shuhui. Action identification without bounds on applications based on self-attention mechanism[J]. Journal of Computer Research and Development, 2022, 59(5): 1092−1104 (in Chinese) doi: 10.7544/issn1000-1239.20211158
[6]	Mondal A, Nag S, Prada J M, et al. Actor-agnostic multi-label action recognition with multi-modal query[C]//Proc of the 19th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2023: 784−794
[7]	Xie Wentao, Ren Guanghui, Liu Si. Video relation detection with trajectory-aware multi-modal features[C]//Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 4590−4594
[8]	Nandwani P, Verma R. A review on sentiment analysis and emotion detection from text[J]. Social Network Analysis and Mining, 2021, 11(1): 81 doi: 10.1007/s13278-021-00776-6
[9]	Lin Ji, Gan Chuang, Han Song. Tsm: Temporal shift module for efficient video understanding[C]//Proc of the 28th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 7083−7093
[10]	Wu Chaoyuan, Feichtenhofer C, Fan Haoqi, et al. Long-term feature banks for detailed video understanding[C]//Proc of the 28th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 284−293
[11]	Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]/Proc of the 38th Int Conf on Machine Learning. New York: PMLR, 2021: 8748−8763
[12]	Gao Peng, Geng Shijie, Zhang Rrenrui, et al. Clip-adapter: Better vision-language models with feature adapters[J]. International Journal of Computer Vision, 2023, 12(6): 1−15
[13]	Yang Taojiannan, Zhu Yi, Xie Yusheng, et al. AIM: Adapting image models for efficient video action recognition[J]. arXiv preprint, arXiv: 2302.03024, 2023
[14]	Tirupattur P, Duarte K, Rawat Y S, et al. Modeling multi-label action dependencies for temporal action localization[C]//Proc of the 30th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1460−1470
[15]	Dai Xiyang, Singh B, Ng Jy H, et al. Tan: Temporal aggregation network for dense multi-label action recognition[C]///Proc of the 19th IEEE Winter Conf on Applications of Computer Vision (WACV). Piscataway, NJ: IEEE, 2019: 151−160
[16]	Sozykin K, Protasov S, Khan A, et al. Multi-label class-imbalanced action recognition in hockey videos via 3d convolutional neural networks[C]//Proc of the 19th IEEE/ACIS Int Conf on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. Piscataway, NJ: IEEE, 2018: 146−151
[17]	Ji Shuiwang, Xu Wei, Yang Ming, et al. 3d convolutional neural networks for human action recognition[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221−231
[18]	Yao Guangle, Lei Tao, Zhong Jiandan. A review of convolutional-neural-network-based action recognition[J]. Pattern Recognition Letters, 2019, 18(6): 14−22
[19]	Zhang Yanyi, Li Xinyu, Marsic I. Multi-label activity recognition using activity-specific features and activity correlations[C]//Proc of the 30th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 14625−14635
[20]	Ni Bolin, Peng Houwen, Chen Minghao, et al. Expanding language-image pretrained models for general video recognition[C]//Proc of the 17th IEEE/CVF European Conf on Computer Vision. Berlin: Springer, 2022: 1−18
[21]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems. Berlin: Springer, 2017, 6000−6010
[22]	Gu Chunhui, Sun Chen, Ross D A, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions[C]//Proc of the 27th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6047−6056
[23]	Huang Qingqiu, Xiong Yu, Rao Anyi, et al. Movienet: A holistic dataset for movie understanding[C]//Proc of the 16th IEEE/CVF European Conf on Computer Vision. Berlin: Springer, 2020: 709−727
[24]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint, arXiv: 2010.11929, 2020
[25]	Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412.6980, 2014
[26]	Su Wanhua, Yuan Yan, Zhu Mu. A relationship between the average precision and the area under the roc curve[C]//Proc of the 5th Int Conf on the Theory of Information Retrieval. New York: ACM, 2015: 349−352
[27]	Sasaki Y. The truth of the f-measure[J/OL]. Teach Tutor Mater, 2007, 1(5)[2024-09-19]. https://www.researchgate.net/publication/268185911_The_truth_of_the_f-measure
[28]	Narayan S, Gupta A, Khan S, et al. Discriminative region-based multi-label zero-shot learning[C]//Proc of the 30th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 8731−8740
[29]	Kerrigan A, Duarte K, Rawat Y, et al. Reformulating zero-shot action recognition for multi-label actions[C]//Proc of the 35th IEEE/CVF Annual Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2021: 25566−25577
[30]	Sun Ximeng, Hu Ping, Saenko K. Dualcoop: Fast adaptation to multi-label recognition with limited annotations[C]. Proc of the 36th IEEE/CVF Annual Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2022: 30569−30582
[31]	Zhou Kaiyang, Yang Jingkang, Loy C C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9): 2337−2348 doi: 10.1007/s11263-022-01653-1
[32]	Zhou Kaiyang, Yang Jingkang, Loy C C, et al. Conditional learning for vision-language models[C]//Proc of the 31st IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 16816−16825