大语言模型知识引导的开放域多标签动作识别

朱荣江; 石语珩; 杨硕; 王子奕; 吴心筱

doi:10.7544/issn1000-1239.202440522

大语言模型知识引导的开放域多标签动作识别

Open-Vocabulary Multi-Label Action Recognition Guided by LLM Knowledge

摘要

摘要: 开放域多标签动作识别任务旨在对视频中训练阶段未见的人的多类动作进行识别. 相较于传统动作识别，该任务更适应实际场景，具有广泛的应用前景. 然而，开放域多标签动作识别具有很大的挑战性，需要将模型有效泛化到未见过的新动作类别. 为了解决此问题，提出大语言模型知识引导的开放域多标签动作识别方法. 该方法挖掘大语言模型蕴含的丰富的动作类别共现知识，并将共现知识嵌入视觉-语言模型的提示学习，实现基本动作类别（base action classes）与新动作类别（novel action classes）之间的信息传递，从而提升新类别的识别性能. 在实验中将基本动作类别和新动作类别的比例设置为3∶1和1∶1，分别表示为“75% 可见”和“50% 可见”. 在AVA和MovieNet数据集上的实验结果表明，相较于现有方法，当基本动作类别为“75%”时，该方法在2个数据集的新动作类别识别指标mAP上分别提升了1.95个百分点和1.21个百分点；当面临基本动作类别为“50%”的更困难场景时，提出的方法在这2个数据集上新动作类别识别指标mAP上分别提升了2.59个百分点和1.06个百分点.

Abstract: Open-vocabulary multi-label action recognition tasks aim to identify various human actions in videos that are not seen during the training phase. Compared with traditional action recognition, this task is more practical as it closely mirrors real-world scenarios and has broader application prospects. However, it poses significant challenges in effectively generalizing models to unseen action categories. To address this issue, we propose an open-vocabulary multi-label action recognition method enhanced by the knowledge of large language models knowledge. This method extracts rich co-occurrence knowledge of action categories implicit in large language models and incorporates this co-occurrence knowledge into prompt learning of visual-language models, facilitating information transfer between base action classes and novel action classes to improve the recognition performance of novel action classes. We set up two ratios of base action classes to novel action classes in experiments, namely 3∶1 and 1∶1, represented as “75% seen” and “50% seen” respectively. Experimental results on AVA and MovieNet datasets show that compared with existing methods, when the base action classes are “75% seen”, our method improves the mAP metric for novel action recognition by 1.95% and 1.21% on AVA and MovieNet datasets, respectively. When faced with the more challenging scenario of “50% seen”, our method improves the mAP metric for novel action recognition by 2.59% and 1.06% on the two datasets, respectively.

HTML全文

参考文献(32)

施引文献

资源附件(0)