高级检索

    大语言模型知识引导的开放域多标签动作识别

    Open-Vocabulary Multi-Label Action Recognition Guided by LLM Knowledge

    • 摘要: 开放域多标签动作识别任务旨在对视频中训练阶段未见的人的多类动作进行识别. 相较于传统动作识别,该任务更适应实际场景,具有广泛的应用前景. 然而,开放域多标签动作识别具有很大的挑战性,需要将模型有效泛化到未见过的新动作类别. 为了解决此问题,提出大语言模型知识引导的开放域多标签动作识别方法. 该方法挖掘大语言模型蕴含的丰富的动作类别共现知识,并将共现知识嵌入视觉-语言模型的提示学习,实现基本类别(base classes)与新类别(novel classes)之间的信息传递,从而提升新类别的识别性能. 在实验中将基本动作类别和新动作类别的比例设置为3꞉1和1꞉1,分别表示为“75% 可见”和“50% 可见”. 在AVA和数据集上的实验结果表明,相较于现有方法,当基本动作类别为“75%”时,该方法在AVA和MovieNet数据集的新动作类别识别指标mAP上分别提升了1.95%和1.21%;当面临基本动作类别为“50%”的更困难场景时,提出的方法在这2个数据集上新动作类别识别指标mAP上分别提升了2.59%和1.06%.

       

      Abstract: Open-vocabulary multi-label action recognition tasks aim to identify various human actions in videos that were not seen during the training phase. Compared to traditional action recognition, this task is more practical as it closely mirrors real-world scenarios and has broader application prospects. However, it poses significant challenges in effectively generalizing models to unseen action categories. To address this issue, this paper proposes an open-vocabulary multi-label action recognition method enhanced by the knowledge of large language models knowledge. This method extracts rich co-occurrence knowledge of action categories implicit in large language models and incorporates this co-occurrence knowledge into prompt learning of visual-language models, facilitating information transfer between base classes and novel classes to improve the recognition performance of novel classes. We set up two ratios of base action classes to novel action classes in experiments, namely 3꞉1 and 1꞉1, represented as "75% seen" and "50% seen" respectively. Experimental results on the AVA and MovieNet datasets show that compared to existing methods, when the base action classes are "75% seen", our method improves the mAP metric for novel action recognition by 1.95% and 1.21% on the AVA and MovieNet datasets, respectively. When faced with the more challenging scenario of "50% seen", our method improves the mAP metric for novel action recognition by 2.59% and 1.06% on the two datasets, respectively.

       

    /

    返回文章
    返回