Domain Named Entity Recognition Combining GAN and BiLSTM-Attention-CRF
-
摘要: 领域内命名实体识别通常面临领域内标注数据缺乏以及由于实体名称多样性导致的同一文档中实体标注不一致等问题.针对以上问题,利用生成式对抗网络(generative adversarial network, GAN)可以生成数据的特点,将生成式对抗网络与BiLSTM-Attention-CRF模型相结合.首先以BiLSTM-Attention作为生成式对抗网络的生成器模型,以CNN作为判别器模型,从众包标注数据集中整合出与专家标注数据分布一致的正样本标注数据来解决领域内标注数据缺乏的问题;然后通过在BiLSTM-Attention-CRF模型中引入文档层面的全局向量,计算每个单词与该全局向量的关系得出其新的特征表示以解决由于实体名称多样化造成的同一文档中实体标注不一致问题;最后,在基于信息安全领域众包标注数据集上的实验结果表明,该模型在各项指标上显著优于同类其他模型方法.
-
关键词:
- 领域命名实体识别 /
- 生成式对抗网络 /
- 众包标注数据 /
- 实体标注一致 /
- BiLSTM-Attention-CRF模型
Abstract: Domain named entity recognition usually faces the lack of domain annotation data and the inconsistency of entity annotation in the same document due to the diversity of entity names in the domain. To issue the above problems, our work draws on the combination of the generative adversarial network (GAN) which can generate data and the BiLSTM-Attention-CRF model. Firstly, BiLSTM-Attention is used as the generator model of GAN, and CNN is used as the discriminant model. The two models use the crowd annotations and the expert annotations to train respectively, and integrate the positive annotation data consistent with the expert annotation data distribution from the crowd annotations to solve the problem of lack of annotation data in the domain; then we also introduce a new method to obtain the new feature representation of each word in the document through introducing a document-level global feature in the BiLSTM-Attention-CRF model in order to solve the problem of inconsistency of the entity in the same document caused by the diversification of the entity name. Finally, taking the crowd annotations in the information security field as a sample, a comprehensive horizontal evaluation experiment is carried out by learning the common features and applying them to the training BiLSTM-Attention-CRF model for the identification of named entities in the information security field. The results show that compared with the existing models and methods, the model we proposed has made great progress on various indicators, reflecting its superiority. -
-
期刊类型引用(6)
1. 华书蓓,刘于超,白雅雯,郑际俊. 电能表数据采集终端负载自适应均衡方法研究. 自动化仪表. 2024(03): 78-82 . 百度学术 2. 吕鹤轩,黄山,艾力卡木·再比布拉,吴思衡,段晓东. Flink水位线动态调整策略. 计算机工程与科学. 2023(02): 237-245 . 百度学术 3. 梁懿,刘迪,陈又咏,董晓祺,许志毅. 国产化环境下的海量小文件数据分布式存储技术. 计算技术与自动化. 2023(03): 141-146 . 百度学术 4. 邓国宝,查晓文,刘涛,冯灿,薛博文. 试飞数据查询引擎设计. 计算机测量与控制. 2023(10): 208-213+221 . 百度学术 5. 邓国宝,查晓文,冯灿,张逸飞,薛博文. 试飞数据平台数据架构设计与应用. 计算机测量与控制. 2023(12): 271-276 . 百度学术 6. 张茂君,李俊华,邢海涛,朱庭楠,孙健. 基于Hadoop和Flink的电力供应链数据中台建设与应用. 电力大数据. 2022(02): 55-63 . 百度学术 其他类型引用(4)
计量
- 文章访问数:
- HTML全文浏览量: 0
- PDF下载量:
- 被引次数: 10