Abstract:
Recognizing chemical compound and drug name from unstructured data in the field of biomedical text mining is of great significance. The current popular approaches are based on CRF model which needs large amounts of hand-crafted features, and these approaches inevitably have the tagging non-consistency problem (the same mentions in a document are tagged different labels). In this paper, we propose an attention-based BiLSTM-CRF architecture to mitigate these aforementioned drawbacks. First, word embedding is obtained from vast amounts of unlabeled biomedical text. Then the characters of current word are fed to a BiLSTM layer to learn the character representation of this word. After this, word and character representations are transformed to another BiLSTM layer and the current adjacency context representation of this word is generated. Then we use attention mechanism to obtain the current word’s context at document level on the basis of the adjacency context of all words in this document and the current word. At last, a CRF layer is used to predict the label sequence of this document according to the integration of the current adjacency context and the document-level context. Experimental results show that our method improves the consistency of mention’s label in the same document, and it can also achieve better performance (an F-score of 90.77%) than the state-of-the-art methods on the BioCreative IV CHEMDNER corpus.