面向远程监督命名实体识别的噪声检测

王嘉诚; 王凯; 王昊奋; 杜渂; 何之栋; 阮彤; 刘井平

doi:10.7544/issn1000-1239.202220999

面向远程监督命名实体识别的噪声检测

Noise Detection for Distant Supervised Named Entity Recognition

摘要

摘要: 针对远程监督命名实体识别（named entity recognition, NER）任务，目前有许多基于强化学习的方法，利用强化学习的强大决策能力，对远程监督生成的自动标注数据进行噪声过滤. 然而，这些方法所使用的策略网络模型架构都较简单，识别噪声能力较弱，且都以完整的句子样本为单位进行识别，导致句子中的部分正确信息被丢弃. 为解决上述问题，提出了一种新的基于强化学习的方法，称为RLTL-DSNER，该方法可以从远程监督生成的带噪数据中，以单词级别识别正确实例，减少噪声实例对远程监督NER的负面影响. 具体来说，在策略网络模型中引入了标签置信函数来准确识别实例. 此外，提出了一种新颖的NER模型预训练策略，使其能为强化学习的初始训练提供精准的状态表示和有效的奖励值，引导其向正确的方向更新. 在4个数据集上的实验结果验证了RLTL-DSNER方法的优越性，在NEWS数据集上，相较于现有最先进的方法，获得了4.28%的F1提升.

Abstract: On distantly supervised named entity recognition (NER), there are many reinforcement learning based approaches, which exploit the powerful decision-making ability of reinforcement learning to detect noise from the automatically labeled data generated by distant supervision. However, the structures of the policy network models used are typically simple, which results in a weak ability to recognize noisy instances. Furthermore, correct instances are identified at sentence level, resulting in part of the useful information in the sentence being discarded. In this paper, we propose a new reinforcement learning based method for distantly supervised NER, named RLTL-DSNER, which can detect correct instances at token level from noisy data generated by distant supervision, proposing to reduce the negative impact of noisy instances on distantly supervised NER model. Specifically, we introduce a tag confidence function to identify correct instances accurately. In addition, we propose a novel pretraining strategy for the NER model. This strategy can provide accurate state representations and effective reward values for the initial training of the reinforcement learning model. The pre-training strategy can help guide it to update in the right direction. We conduct experiments on four datasets to verify the superiority of the RLTL-DSNER method, gaining 4.28% F1 improvement on NEWS dataset over state-of-the-art methods.

HTML全文

参考文献(38)

施引文献

资源附件(0)