基于强化策略反馈的多模态自适应实体识别方法

焦明海; 樊本航; 王静; 彭玉怀

doi:10.7544/issn1000-1239.202550273

基于强化策略反馈的多模态自适应实体识别方法

A Multimodal Adaptive Entity Recognition Method Based on Reinforcement Strategy Feedback

摘要

摘要: 命名实体识别（named entity recognition，NER）的核心目标是从非结构化文本中识别出具有特定语义类别的实体与类型。随着社交媒体的迅速发展，文本信息往往与视觉信息共同出现，形成多模态内容。为了提升实体识别的准确性，多模态命名实体识别（multi-modal NER，MNER）方法利用不同模态中的语义信息，实现信息互补与深度融合。然而，不同模态之间的表征差异可能引入视觉噪声，干扰实体识别。文本模态中存在实体指代不清或上下文语义模糊的问题，增加了识别难度。针对上述问题，提出了一种基于强化策略反馈与自适应损失机制的MNER方法。首先，该方法采用基于 GPT-4o的3阶段思维链（chain of thought，COT）推理流程，形成渐进式推理框架，融合强化学习中的自适应反馈机制，对图像与文本之间的匹配程度进行评分，并利用自适应决策函数有效过滤视觉噪声的干扰。其次，设计了4类面向具体任务的损失函数，并利用自适应加权融合策略进行优化，以缓解上下文模糊带来的识别不确定性。在2个公开数据集Twitter-2015和Twitter-2017上开展实验，结果表明所提方法的总体F1分数分别达到86.45%和93.80%，显著优于当前主流基线模型。

Abstract: The core objective of the named entity recognition (NER) is to identify entities with specific semantic categories and types in unstructured text. With the rapid growth of social media, textual information is often accompanied by visual content, forming multimodal data. To improve the accuracy of entity recognition, multimodal NER (MNER) methods exploit semantic information from different modalities fully to achieve the complementary and deep fusion of cross-modal features. However, differences in the representation of modalities may introduce visual noise that interferes with entity recognition. Issues such as entity ambiguity or contextual semantic vagueness within the textual modality complicate recognition. To address these challenges, we propose a multimodal NER method based on a reinforcement strategy feedback with adaptive loss mechanism. First, the method adopts a three-stage chain of thought (COT) reasoning process based on GPT-4o to form a progressive reasoning framework that incorporates the adaptive feedback mechanism in reinforcement learning. The degree of matching between images and text is scored, and the interference of visual noise is effectively filtered using an adaptive decision function. Second, four task-specific loss functions are designed and jointly optimized through an adaptive weighted fusion strategy to alleviate the uncertainty caused by contextual ambiguity. Experiments on two representative public datasets (Twitter-2015 and Twitter-2017) show that the overall F1 scores of our proposed method are 86.45% and 93.80%, respectively, representing a significant improvement on current state-of-the-art baseline models.

HTML全文

参考文献(28)

施引文献

资源附件(0)