Abstract:
Open-vocabulary multi-label action recognition tasks aim to identify various human actions in videos that are not seen during the training phase. Compared with traditional action recognition, this task is more practical as it closely mirrors real-world scenarios and has broader application prospects. However, it poses significant challenges in effectively generalizing models to unseen action categories. To address this issue, we propose an open-vocabulary multi-label action recognition method enhanced by the knowledge of large language models knowledge. This method extracts rich co-occurrence knowledge of action categories implicit in large language models and incorporates this co-occurrence knowledge into prompt learning of visual-language models, facilitating information transfer between base action classes and novel action classes to improve the recognition performance of novel action classes. We set up two ratios of base action classes to novel action classes in experiments, namely 3∶1 and 1∶1, represented as “75% seen” and “50% seen” respectively. Experimental results on AVA and MovieNet datasets show that compared with existing methods, when the base action classes are “75% seen”, our method improves the mAP metric for novel action recognition by 1.95% and 1.21% on AVA and MovieNet datasets, respectively. When faced with the more challenging scenario of “50% seen”, our method improves the mAP metric for novel action recognition by 2.59% and 1.06% on the two datasets, respectively.