基于大语言模型伪标签生成的长视频问答关键帧选择增强方法

宋子杰; 关鑫; 胡珍珍; 李佳; 洪日昌

doi:10.7544/issn1000-1239.202550537

基于大语言模型伪标签生成的长视频问答关键帧选择增强方法

Enhancing Keyframe Selection Method via LLM-Generated Pseudo-Labels for Long-Form Video Question Answering

摘要

摘要: 关键帧选择作为长视频问答任务的重要方法，能够从冗余信息中准确定位关键内容，并构建可解释的推理路径。然而，现有的关键帧选择方法在端到端训练中存在语义敏感性不足的问题，导致关键帧选择过程中引入大量无关帧噪声，从而影响模型的准确性和可解释性。为此，提出了一种基于伪标签引导的关键帧选择（pseudo-labels guided keyframe selection, PGKS）方法。该方法首先利用大语言模型对问题与答案进行语义融合，生成全局描述性文本，并结合多模态对齐模型，计算描述与视频采样帧之间的语义相似度分数，从而构建帧级伪标签. 然后，通过伪标签引导帧级相似度分数的计算，有效抑制了无关帧噪声，增强了关键帧选择的准确性与可解释性。这种大语言模型的利用和训练分离的方法，为资源受限场景种对大语言模型的利用提供了一种低成本的方案。此外，关键帧选择方法平衡了可微性与硬排序的精准性，并通过引入滑动时间窗口机制，进一步提升了模型对视频时序关系的理解能力。实验结果表明，PGKS模型在长视频问答数据集NExT-QA上达到了62.67%的准确率，优于现有同体量模型的方法，并较无伪标签引导的基线提升了8.51%。

Abstract: Keyframe selection is an important method for long-form video question answering, as it accurately identifies key content from redundant information and establishes an interpretable reasoning path. However, existing keyframe selection methods face challenges with insufficient semantic sensitivity during end-to-end training, leading to the introduction of a significant amount of irrelevant frame noise, which affects the model's accuracy and interpretability. To address this, we propose a pseudo-labels guided keyframe selection (PGKS) model. This model first utilizes large language models (LLMs) to semantically integrate questions and answers, generating global descriptive text. It then employs a multimodal alignment model to compute semantic similarity scores between the description and video sampled frames, thereby constructing frame-level pseudo-labels. By guiding the calculation of frame-level similarity scores with these pseudo-labels, the model effectively suppresses irrelevant frame noise, enhancing both the accuracy and interpretability of keyframe selection. The decoupling between the utilization of LLMs and the training of the backbone network is architecturally designed to provide a cost-effective solution for leveraging LLMs in resource-constrained scenarios. Furthermore, our method balances differentiable and non-differentiable sorting results, and introduces a sliding time window mechanism to further improve the model's understanding of temporal relationships in videos. Experimental results demonstrate that the PGKS model achieves an accuracy of 62.67% on the long-form video question answering dataset NExT-QA, outperforming existing methods of comparable size and improving by 8.51% over the baseline without pseudo-label guidance.

HTML全文

参考文献(40)

施引文献

资源附件(0)