Abstract:
Open-vocabulary multi-label action recognition tasks aim to identify various human actions in videos that were not seen during the training phase. Compared to traditional action recognition, this task is more practical as it closely mirrors real-world scenarios and has broader application prospects. However, it poses significant challenges in effectively generalizing models to unseen action categories. To address this issue, this paper proposes an open-vocabulary multi-label action recognition method enhanced by the knowledge of large language models knowledge. This method extracts rich co-occurrence knowledge of action categories implicit in large language models and incorporates this co-occurrence knowledge into prompt learning of visual-language models, facilitating information transfer between base classes and novel classes to improve the recognition performance of novel classes. We set up two ratios of base action classes to novel action classes in experiments, namely 3꞉1 and 1꞉1, represented as "75% seen" and "50% seen" respectively. Experimental results on the AVA and MovieNet datasets show that compared to existing methods, when the base action classes are "75% seen", our method improves the mAP metric for novel action recognition by 1.95% and 1.21% on the AVA and MovieNet datasets, respectively. When faced with the more challenging scenario of "50% seen", our method improves the mAP metric for novel action recognition by 2.59% and 1.06% on the two datasets, respectively.