基于生物医学文献的化学物质致病关系抽取

李智恒; 桂颖溢; 杨志豪; 林鸿飞; 王健

doi:10.7544/issn1000-1239.2018.20160893

基于生物医学文献的化学物质致病关系抽取

Chemical-Induced Disease Relation Extraction Based on Biomedical Literature

摘要

摘要: 化学物质和疾病之间的副作用关系使得化学物质-疾病关系受到更多关注.介绍一个从生物医学文献中抽取化学物质致病关系的系统——CDRExtractor.该系统首先训练一个句子级别分类器，用于抽取存在于同一个句子中的化学物质致病(chemical-induced disease, CID)关系.在句子级别分类器训练阶段，将特征核和图核特征看作2个独立的视图，采用基于半监督的Co-training方法，利用少量人工标注的训练集和大量未标注语料训练模型.之后，CDRExtractor利用文档级别的化学物质与疾病信息特征训练一个文档级别的分类器用于实现文档级别跨句子的CID关系抽取.最后，利用规则将2个分类器的抽取结果进行整合，生成最终的输出结果.实验结果表明：CDRExtractor在BioCreative V CDR评测任务CID子任务提供的测试集上F值达到67.72%.

Abstract: drug reactions between chemicals and diseases make the topic of chemical-disease relations (CDRs) become a focus that receives much concern. And automatic extraction of chemical-induced disease (CID) relations from the biomedical literature can be used to support biocuration, new drug discovery and drug safety surveillance. In this paper, we present a CID relation extraction system, called CDRExtractor, to extract CID relations from biomedical literature at both sentence and document levels. To extract the CID relations located in the same sentence, we first manually annotate a sentence-level training set which is used to train the sentence-level classifier. And to improve the performances of the classifier, Co-training algorithm is used to exploit the unlabeled data with the feature kernel and graph kernel as two independent views. Then CDRExtractor uses a document-level classifier to extract the span sentence CID relations. The classifier utilizes the document level information (features) of the chemical and disease pair, and then returns the CID relations at the document level. Finally, the post-processing rules are applied to the union set of two classifiers and generate the final outputs. Experimental results show that CDRExtractor achieves an F-score of 67.72% on the test set of the BioCreative V CDR CID subtask.

HTML全文

参考文献(0)

施引文献

资源附件(0)