• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Wang Jiacheng, Wang Kai, Wang Haofen, Du Wen, He Zhidong, Ruan Tong, Liu Jingping. Noise Detection for Distant Supervised Named Entity Recognition[J]. Journal of Computer Research and Development, 2024, 61(4): 916-928. DOI: 10.7544/issn1000-1239.202220999
Citation: Wang Jiacheng, Wang Kai, Wang Haofen, Du Wen, He Zhidong, Ruan Tong, Liu Jingping. Noise Detection for Distant Supervised Named Entity Recognition[J]. Journal of Computer Research and Development, 2024, 61(4): 916-928. DOI: 10.7544/issn1000-1239.202220999

Noise Detection for Distant Supervised Named Entity Recognition

Funds: This work was supported by the Shanghai Municipal Special Fund for Promoting High-quality Development of Industries (2021-GZL-RGZN-01018), the National Key Research and Development Program of China (2021YFC2701800, 2021YFC2701801), the Open Project of Zhejiang Lab (2019ND0AB01), and the Shanghai Sailing Program (23YF1409400).
More Information
  • Author Bio:

    Wang Jiacheng: born in 2000. Master candidate. His main research interests include machine learning and information extraction

    Wang Kai: born in 1997. Master candidate. His main research interests include machine learning and data mining

    Wang Haofen: born in 1982. PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include knowledge graph, natural language processing, and data mining

    Du Wen: born in 1975. Professoriate senior engineer. His main research interests include big data and artificial intelligence

    He Zhidong: born in 1989. PhD, senior engineer. His main research interests include complex networks and knowledge graph

    Ruan Tong: born in 1973. PhD, professor, PhD supervisor. Member of CCF. Her main research interests include knowledge graph, data mining, and data quality assessment

    Liu Jingping: born in 1991. PhD, lecturer. His main research interests include knowledge graph and natural language processing

  • Received Date: December 06, 2022
  • Revised Date: June 29, 2023
  • Available Online: January 29, 2024
  • On distantly supervised named entity recognition (NER), there are many reinforcement learning based approaches, which exploit the powerful decision-making ability of reinforcement learning to detect noise from the automatically labeled data generated by distant supervision. However, the structures of the policy network models used are typically simple, which results in a weak ability to recognize noisy instances. Furthermore, correct instances are identified at sentence level, resulting in part of the useful information in the sentence being discarded. In this paper, we propose a new reinforcement learning based method for distantly supervised NER, named RLTL-DSNER, which can detect correct instances at token level from noisy data generated by distant supervision, proposing to reduce the negative impact of noisy instances on distantly supervised NER model. Specifically, we introduce a tag confidence function to identify correct instances accurately. In addition, we propose a novel pretraining strategy for the NER model. This strategy can provide accurate state representations and effective reward values for the initial training of the reinforcement learning model. The pre-training strategy can help guide it to update in the right direction. We conduct experiments on four datasets to verify the superiority of the RLTL-DSNER method, gaining 4.28% F1 improvement on NEWS dataset over state-of-the-art methods.

  • [1]
    李冬梅,张扬,李东远,等. 实体关系抽取方法研究综述[J]. 计算机研究与发展,2020,57(7):1424−1448 doi: 10.7544/issn1000-1239.2020.20190358

    Li Dongmei, Zhang Yang, Li Dongyuan, et al. Review of entity relation extraction methods[J]. Journal of Computer Research and Development, 2020, 57(7): 1424−1448(in Chinese) doi: 10.7544/issn1000-1239.2020.20190358
    [2]
    Mutabazi E, Ni Jianjun, Tang Guangyi, et al. A review on medical textual question answering systems based on deep learning approaches[J/OL]. Applied Sciences, 2021[2023-05-24].https://www.mdpi.com/2076-3417/11/12/5456
    [3]
    胡宇,申德荣,聂铁铮,等. 面向生物医学实体链接的联合式学习方法[J]. 计算机学报,2022,45(4):748−765 doi: 10.11897/SP.J.1016.2022.00748

    Hu Yu, Shen Derong, Nie Tiezheng, et al. A joint learning method for biomedical entity linking[J]. Chinese Journal of Computers, 2022, 45(4): 748−765 (in Chinese) doi: 10.11897/SP.J.1016.2022.00748
    [4]
    杨玉基,许斌,胡家威,等. 一种准确而高效的领域知识图谱构建方法[J]. 软件学报,2018,29(10):2931−2947

    Yang Yuji, Xu Bin, Hu Jiawei, et al. Accurate and efficient method for constructing domain knowledge graph[J]. Journal of Software, 2018, 29(10): 2931−2947 (in Chinese)
    [5]
    王萌,王昊奋,李博涵,等. 新一代知识图谱关键技术综述[J]. 计算机研究与发展,2022,59(9):1947−1965 doi: 10.7544/issn1000-1239.20210829

    Wang Meng, Wang Haofen, Li Bohan, et al. Survey on key technologies of new generation knowledge graph[J]. Journal of Computer Research and Development, 2022, 59(9): 1947−1965 (in Chinese) doi: 10.7544/issn1000-1239.20210829
    [6]
    王飞,刘井平,刘斌,等. 代码知识图谱构建及智能化软件开发方法研究[J]. 软件学报,2020,31(1):47−66

    Wang Fei, Liu Jingping, Liu Bin, et al. Survey on construction of code knowledge graph and intelligent software development[J]. Journal of Software, 2020, 31(1): 47−66 (in Chinese)
    [7]
    Souza F, Nogueira R, Lotufo R. Portuguese named entity recognition using BERT-CRF[J]. arXiv preprint, arXiv: 1909.10649, 2019
    [8]
    Hao Fei, Ji Donghong, Li Bobo, et al. Rethinking boundaries: End-to-end recognition of discontinuous mentions with pointer networks[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 12785−12793
    [9]
    Xie Chenhao, Liang Jiaqing, Liu Jingping, et al. Revisiting the negative data of distantly supervised relation extraction[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (ACL-IJCNLP). Stroudsburg, PA: ACL, 2021: 3572−3581
    [10]
    Lange L, Hedderich M A, Klakow D. Feature-dependent confusion matrices for low-resource NER labeling with noisy labels[C]// Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA: ACL, 2019: 3554−3559
    [11]
    Li Yangming, Liu Lemao, Shi Shuming. Empirical analysis of unlabeled entity problem in named entity recognition[C/OL]//Proc of the 9th Int Conf on Learning Representations. 2021[2023-05-24].https://openreview.net/forum?id=5jRVa89sZk
    [12]
    Li Yangming, Liu Lemao, Shi Shuming. Rethinking negative sampling for unlabeled entity problem in named entity recognition[J]. arXiv preprint, arXiv: 2108.11607, 2021
    [13]
    Yang Yaosheng, Chen Wenliang, Li Zhenghua, et al. Distantly supervised NER with partial annotation learning and reinforcement learning[C]//Proc of the 27th Int Conf on Computational Linguistics. Stroudsburg, PA: ACL, 2018: 2159−2169
    [14]
    Nooralahzadeh F, Lønning J T, Øvrelid L. Reinforcement-based denoising of distantly supervised NER with partial annotation[C]//Proc of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP. Stroudsburg, PA: ACL, 2019: 225−234
    [15]
    Berger A, Della Pietra S A, Della Pietra V J. A maximum entropy approach to natural language processing[J]. Computational linguistics, 1996, 22(1): 39−71
    [16]
    Hu Weiming, Tian Guodong, Kang Yongxin, et al. Dual sticky hierarchical Dirichlet process hidden Markov model and its application to natural language description of motions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(10): 2355−2373
    [17]
    Chen P H, Lin C J, Schölkopf B. A tutorial on ν-support vector machines[J]. Applied Stochastic Models in Business and Industry, 2005, 21(2): 111−136 doi: 10.1002/asmb.537
    [18]
    Lee C, Hwang Y G, Oh H J, et al. Fine-grained named entity recognition using conditional random fields for question answering[C]//Proc of the 3rd Asia Conf on Information Retrieval Technology. Berlin: Springer, 2006: 581−587
    [19]
    Devlin J, Chang Mingwei, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint, arXiv: 1810.04805, 2018
    [20]
    罗凌,杨志豪,宋雅文,等. 基于笔画ELMo和多任务学习的中文电子病历命名实体识别研究[J]. 计算机学报,2020,43(10):1943−1957 doi: 10.11897/SP.J.1016.2020.01943

    Luo Ling, Yang Zhihao, Song Yawen, et al. Chinese clinical named entity recognition based on stroke ELMo and multi-task learning[J]. Chinese Journal of Computers, 2020, 43(10): 1943−1957 (in Chinese) doi: 10.11897/SP.J.1016.2020.01943
    [21]
    Xu Canwen, Wang Feiyang, Han Jialong, et al. Exploiting multiple embeddings for Chinese named entity recognition[C]//Proc of the 28th ACM Int Conf on Information and Knowledge Management. New York: ACM, 2019: 2269−2272
    [22]
    Shang Jingbo, Liu Liyuan, Gu Xiaotao, et al. Learning named entity tagger using domain-specific dictionary[C]//Proc of the 2018 Conf on Empirical Methods in Natural Language Processing(EMNLP). Stroudsburg, PA: ACL, 2018: 2054−2064
    [23]
    Wang Xuan, Zhang Yu, Li Qi, et al. Distantly supervised biomedical named entity recognition with dictionary expansion[C]//Proc of 2019 IEEE Int Conf on Bioinformatics and Biomedicine (BIBM). Piscataway, NJ: IEEE, 2019: 496−503
    [24]
    高建伟,万怀宇,林友芳. 融合实体外部知识的远程监督关系抽取方法[J]. 计算机研究与发展,2022,59(12):2794−2802 doi: 10.7544/issn1000-1239.20210445

    Gao Jianwei, Wan Huaiyu, Lin Youfang. Exploiting external entity knowledge for distantly supervised relation extraction[J]. Journal of Computer Research and Development, 2022, 59(12): 2794−2802(in Chinese) doi: 10.7544/issn1000-1239.20210445
    [25]
    Peng Minlong, Xing Xiaoyu, Zhang Qi, et al. Distantly supervised named entity recognition using positive-unlabeled learning[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics(ACL). Stroudsburg, PA: ACL, 2019: 2409-2419
    [26]
    Liang Chen, Yu Yue, Jiang Haoming, et al. Bond: BERT-assisted open-domain named entity recognition with distant supervision[C]//Proc of the 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2020: 1054-1064
    [27]
    Qin Pengda, Xu Weiran, Wang W Y. Robust distant supervision relation extraction via deep reinforcement learning[C]// Proc of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). Stroudsburg, PA: ACL, 2018: 2137−2147
    [28]
    Feng Jun, Minlie Huang, Zhao Li, et al. Reinforcement learning for relation classification from noisy data[C]// Proc of the 32nd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 5779−5786
    [29]
    Jiang Haoming, Zhang Danqing, Cao Tianyu, et al. Named entity recognition with small strongly labeled and large weakly labeled data[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (ACL-IJCNLP). Stroudsburg, PA: ACL, 2021: 1775−1789
    [30]
    Ficek A, Liu Fangyu, Collier N. How to tackle an emerging topic? Combining strong and weak labels for Covid news NER[J]. arXiv preprint, arXiv: 2209.15108, 2022
    [31]
    Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3): 229−256
    [32]
    Li Jiao, Sun Yueping, Johnson R J, et al. BioCreative V CDR task corpus: A resource for chemical disease relation extraction[J/OL]. Database, 2016[2023-05-24].https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414
    [33]
    Levow G A. The third international Chinese language processing bakeoff: Word segmentation and named entity recognition[C]//Proc of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA: ACL, 2006: 108−117
    [34]
    Mao Hongli, Tang Hanlin, Zhang Wen, et al. A Span-based distantly supervised NER with self-learning[C]//Proc of the 9th CCF Int Conf on Natural Language Processing and Chinese Computing. Berlin: Springer, 2020: 192−203
    [35]
    Wang Xuan, Zhang Yu, Ren Xiang, et al. Cross-type biomedical named entity recognition with deep multi-task learning[J]. Bioinformatics, 2019, 35(10): 1745−1752 doi: 10.1093/bioinformatics/bty869
    [36]
    Sharma S, Daniel Jr R. BioFLAIR: Pretrained pooled contextualized embeddings for biomedical sequence labeling tasks[J]. arXiv preprint, arXiv: 1908.05760, 2019
    [37]
    Kocaman V, Talby D. Biomedical named entity recognition at scale[C]//Proc of the 25th Int Conf on Pattern Recognition(ICPR). Berlin: Springer, 2021: 635−646
    [38]
    Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA: ACL, 2019: 3615−3620
  • Related Articles

    [1]Li Qinxin, Wu Wenhao, Wang Zhaohua, Li Zhenyu. DNS Recursive Resolution Service Security: Threats, Defenses, and Measurements[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440158
    [2]Research on Malicious Domain Detection Technology Based on Semantic Graph Learning[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440375
    [3]Wei Jinxia, Long Chun, Fu Hao, Gong Liangyi, Zhao Jing, Wan Wei, Huang Pan. Malicious Domain Name Detection Method Based on Enhanced Embedded Feature Hypergraph Learning[J]. Journal of Computer Research and Development, 2024, 61(9): 2334-2346. DOI: 10.7544/issn1000-1239.202330117
    [4]Pan Jianwen, Cui Zhanqi, Lin Gaoyi, Chen Xiang, Zheng Liwei. A Review of Static Detection Methods for Android Malicious Application[J]. Journal of Computer Research and Development, 2023, 60(8): 1875-1894. DOI: 10.7544/issn1000-1239.202220297
    [5]Fan Zhaoshan, Wang Qing, Liu Junrong, Cui Zelin, Liu Yuling, Liu Song. Survey on Domain Name Abuse Detection Technology[J]. Journal of Computer Research and Development, 2022, 59(11): 2581-2605. DOI: 10.7544/issn1000-1239.20210121
    [6]Yang Wang, Gao Mingzhe, Jiang Ting. A Malicious Code Static Detection Framework Based on Multi-Feature Ensemble Learning[J]. Journal of Computer Research and Development, 2021, 58(5): 1021-1034. DOI: 10.7544/issn1000-1239.2021.20200912
    [7]Peng Chengwei, Yun Xiaochun, Zhang Yongzheng, Li Shuhao. Detecting Malicious Domains Using Co-Occurrence Relation Between DNS Query[J]. Journal of Computer Research and Development, 2019, 56(6): 1263-1274. DOI: 10.7544/issn1000-1239.2019.20180481
    [8]Dai Hua, Qin Xiaolin, and Bai Chuanjie. A Malicious Transaction Detection Method Based on Transaction Template[J]. Journal of Computer Research and Development, 2010, 47(5): 921-929.
    [9]Li Qianmu and Liu Fengyu. A Risk Detection and Fault Analysis Method for the Strategic Internet[J]. Journal of Computer Research and Development, 2008, 45(10): 1718-1723.
    [10]Zhang Xiaoning and Feng Dengguo. Intrusion Detection for Ad Hoc Routing Based on Fuzzy Behavior Analysis[J]. Journal of Computer Research and Development, 2006, 43(4): 621-626.
  • Cited by

    Periodical cited type(1)

    1. 余莎莎,肖辉,郑清,赵幽. 基于威胁情报的DNS助力医院网络安全建设实践. 中国卫生信息管理杂志. 2024(06): 909-914 .

    Other cited types(1)

Catalog

    Article views (219) PDF downloads (74) Cited by(2)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return