ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (1): 198-206.doi: 10.7544/issn1000-1239.2018.20160893

• 人工智能 • 上一篇    下一篇

基于生物医学文献的化学物质致病关系抽取

李智恒1,桂颖溢2,杨志豪1,林鸿飞1,王健1   

  1. 1(大连理工大学计算机科学与技术学院 辽宁大连 116024);2(北京理工大学光电学院 北京 100081) (zhihengli@mail.dlut.edu.cn)
  • 出版日期: 2018-01-01
  • 基金资助: 
    国家自然科学基金项目(61272373,61340020,61572102,61572098);新世纪优秀人才支持计划基金项目(NCET-13-0084);中央高校基本科研业务费专项资金项目(DUT14YQ213)

Chemical-Induced Disease Relation Extraction Based on Biomedical Literature

Li Zhiheng1, Gui Yingyi2, Yang Zhihao1, Lin Hongfei1, Wang Jian1   

  1. 1(School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024);2(School of Optoelectronics, Beijing Institute of Technology, Beijing 100081)
  • Online: 2018-01-01

摘要: 化学物质和疾病之间的副作用关系使得化学物质-疾病关系受到更多关注.介绍一个从生物医学文献中抽取化学物质致病关系的系统——CDRExtractor.该系统首先训练一个句子级别分类器,用于抽取存在于同一个句子中的化学物质致病(chemical-induced disease, CID)关系.在句子级别分类器训练阶段,将特征核和图核特征看作2个独立的视图,采用基于半监督的Co-training方法,利用少量人工标注的训练集和大量未标注语料训练模型.之后,CDRExtractor利用文档级别的化学物质与疾病信息特征训练一个文档级别的分类器用于实现文档级别跨句子的CID关系抽取.最后,利用规则将2个分类器的抽取结果进行整合,生成最终的输出结果.实验结果表明:CDRExtractor在BioCreative V CDR评测任务CID子任务提供的测试集上F值达到67.72%.

关键词: 信息抽取, 文本挖掘, 半监督学习, Co-training算法, 化学物质-疾病关系

Abstract: drug reactions between chemicals and diseases make the topic of chemical-disease relations (CDRs) become a focus that receives much concern. And automatic extraction of chemical-induced disease (CID) relations from the biomedical literature can be used to support biocuration, new drug discovery and drug safety surveillance. In this paper, we present a CID relation extraction system, called CDRExtractor, to extract CID relations from biomedical literature at both sentence and document levels. To extract the CID relations located in the same sentence, we first manually annotate a sentence-level training set which is used to train the sentence-level classifier. And to improve the performances of the classifier, Co-training algorithm is used to exploit the unlabeled data with the feature kernel and graph kernel as two independent views. Then CDRExtractor uses a document-level classifier to extract the span sentence CID relations. The classifier utilizes the document level information (features) of the chemical and disease pair, and then returns the CID relations at the document level. Finally, the post-processing rules are applied to the union set of two classifiers and generate the final outputs. Experimental results show that CDRExtractor achieves an F-score of 67.72% on the test set of the BioCreative V CDR CID subtask.

Key words: information extraction, text mining, semi-supervised learning, Co-training, chemical-disease relation (CDR)

中图分类号: