ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (12): 2669-2680.doi: 10.7544/issn1000-1239.2016.20160611

• 其他应用技术 • 上一篇    下一篇



  1. (东华大学计算机科学与技术学院 上海 201620) (
  • 出版日期: 2016-12-01
  • 基金资助: 

Structured Processing for Pathological Reports Based on Dependency Parsing

Tian Chiyuan, Chen Dehua, Wang Mei,Le Jiajin   

  1. (College of Computer Science and Technology, Donghua University, Shanghai 201620)
  • Online: 2016-12-01

摘要: 病理检查报告中的文本通常为非结构化数据,不利于计算机自动分析和处理.目前文本结构化主要采用信息关系抽取方法,然而病理检查报告所具有的语义特殊性,给中文信息关系抽取带来了挑战.为解决上述问题,设计了一种针对病理检查报告的结构化方法,首先通过神经网络语言模型获得病理报告中的同义词表,合并一义多词现象;在此基础上,生成病理检查报告文本的依存关系树,并提出切分短句和信息标注的剪裁策略,以简化初始生成的依存关系树结构,从而使语法关系更加清晰,提高结构化结果的准确度;进而,利用依存句法分析结果从中文检查报告中提取指标及对应指标值,并自动生成结构化模板.实验采用医生真实使用的医疗病理检查报告进行验证,其结果表明:该方法在指标词和对应指标值提取任务中的准确率可以分别达到82.91%和79.11%,为相关研究打下了基础.

关键词: 医疗数据, 病理报告, 依存句法分析, 文本结构化处理, 神经网络语言模型

Abstract: Most of pathological reports are unstructured texts which can not be directly analyzed by computers. The current researches on structured texts mainly focus on the information extraction. However, the syntactic features of pathological reports are particular, which makes it more difficult to extract information relations. To solve this problem, a novel method of structuralizing pathological reports based on syntactic and semantic features is proposed in this paper. First of all, we construct a synonym lexicon by using neural network language models to eliminate the phenomenon of synonymy. Then the dependency trees are generated based on the preprocessed pathological reports to extract medical examination indices. Meanwhile, we use short-sentence segmentation and annotation as optimized strategies to simplify the structure of dependency trees, which makes the grammatical relations of medical texts clearer and improves the quality of the structured results. Finally the key-value pairs of medical examination indices can be extracted from pathological reports in Chinese, and the structured texts can be generated automatically. Experimental results based on real pathological report data sets show that the performance of the proposed method on medical indices and values extraction achieves 82.91% and 79.11% of accuracy, which provides a solid foundation for related studies in the future.

Key words: medical data, pathological reports, dependency parsing, text structured processing, neural network language model