高级检索

    基于大语言模型伪标签生成的长视频问答关键帧选择增强方法

    Enhancing Keyframe Selection via LLM-Generated Pseudo-Labels for Long-form Video Question Answering

    • 摘要: 关键帧选择作为长视频问答任务的重要方法,能够从冗余信息中准确定位关键内容,并构建可解释的推理路径。然而,现有的关键帧选择方法在端到端训练中存在语义敏感性不足的问题,导致关键帧选择过程中引入大量无关帧噪声,从而影响模型的准确性和可解释性。为此,本文提出了一种基于伪标签引导的关键帧选择模型(Pseudo-Labels Guided Keyframe Selection, PGKS)。该模型首先利用大语言模型对问题与答案进行语义融合,生成全局描述性文本,并结合多模态对齐模型,计算描述与视频采样帧之间的语义相似度分数,从而构建帧级伪标签。然后,通过伪标签引导帧级相似度分数的计算,有效抑制了无关帧噪声,增强了关键帧选择的准确性与可解释性。此外,本文的关键帧选择方法平衡了可微性与硬排序的精准性,并通过引入滑动时间窗口机制,进一步提升了模型对视频时序关系的理解能力。实验结果表明,PGKS 模型在长视频问答数据集 NExT-QA 上达到了 62.67%的准确率,优于现有同体量模型的方法,并较无伪标签引导的基线提升了8.51%。

       

      Abstract: Keyframe selection is an important method for long-form video question answering, as it accurately identifies key content from redundant information and establishes an interpretable reasoning path. However, existing keyframe selection methods face challenges with insufficient semantic sensitivity during end-to-end training, leading to the introduction of a significant amount of irrelevant frame noise, which affects the model's accuracy and interpretability. To address this, we propose a Pseudo-Labels Guided Keyframe Selection (PGKS) model. This model first utilizes a large language model to semantically integrate questions and answers, generating global descriptive text. It then employs a multimodal alignment model to compute semantic similarity scores between the description and video sampled frames, thereby constructing frame-level pseudo-labels. By guiding the calculation of frame-level similarity scores with these pseudo-labels, the model effectively suppresses irrelevant frame noise, enhancing both the accuracy and interpretability of keyframe selection. Furthermore, our method balances differentiable and non-differentiable sorting results, and introduces a sliding time window mechanism to further improve the model's understanding of temporal relationships in videos. Experimental results demonstrate that the PGKS model achieves an accuracy of 62.67% on the long-form video question answering dataset NExT-QA, outperforming existing methods of comparable size and improving by 8.51% over the baseline without pseudo-label guidance.

       

    /

    返回文章
    返回