ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (7): 1548-1556.doi: 10.7544/issn1000-1239.2018.20170506

• 信息处理 • 上一篇    下一篇



  1. (大连理工大学计算机科学与技术学院 辽宁大连 116024) (
  • 出版日期: 2018-07-01
  • 基金资助: 

An Attention-Based Approach for Chemical Compound and Drug Named Entity Recognition

Yang Pei, Yang Zhihao, Luo Ling, Lin Hongfei, Wang Jian   

  1. (School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024)
  • Online: 2018-07-01

摘要: 在生物医学文本挖掘领域,化学药物命名实体识别具有重要意义.目前的主流方法是基于条件随机场(conditional random fields, CRF)的方法,但是该方法需要大量的人工特征,并且存在实体标签的全文非一致性问题.针对此问题,提出一种基于注意(Attention)机制的深度学习方法.该方法首先从海量生物文本中学习词向量,然后利用双向长短期记忆网络(BiLSTM)学习字符向量,随后将词向量和字符向量再经过另一个BiLSTM以获得词的上下文表示,然后再利用Attention机制获得词在全文范围下的上下文表示,最后利用CRF层得到整篇文章的标签序列.实验结果表明:相比之前的研究方法,提高了在同一篇文章中实体识别的一致性,并在BioCreative IV评测中的CHEMDNER数据集上取得了更好的结果(F值为90.77%).

关键词: 长短期记忆网络, 注意, 条件随机场, 化学药物命名实体识别, 深度学习

Abstract: Recognizing chemical compound and drug name from unstructured data in the field of biomedical text mining is of great significance. The current popular approaches are based on CRF model which needs large amounts of hand-crafted features, and these approaches inevitably have the tagging non-consistency problem (the same mentions in a document are tagged different labels). In this paper, we propose an attention-based BiLSTM-CRF architecture to mitigate these aforementioned drawbacks. First, word embedding is obtained from vast amounts of unlabeled biomedical text. Then the characters of current word are fed to a BiLSTM layer to learn the character representation of this word. After this, word and character representations are transformed to another BiLSTM layer and the current adjacency context representation of this word is generated. Then we use attention mechanism to obtain the current word’s context at document level on the basis of the adjacency context of all words in this document and the current word. At last, a CRF layer is used to predict the label sequence of this document according to the integration of the current adjacency context and the document-level context. Experimental results show that our method improves the consistency of mention’s label in the same document, and it can also achieve better performance (an F-score of 90.77%) than the state-of-the-art methods on the BioCreative IV CHEMDNER corpus.

Key words: long short-term memory (LSTM), attention, conditional random fields (CRF), chemical compound and drug name recognition, deep learning