基于依存句法分析的病理报告结构化处理方法

田驰远; 陈德华; 王梅; 乐嘉锦

doi:10.7544/issn1000-1239.2016.20160611

基于依存句法分析的病理报告结构化处理方法

Structured Processing for Pathological Reports Based on Dependency Parsing

摘要

摘要: 病理检查报告中的文本通常为非结构化数据，不利于计算机自动分析和处理.目前文本结构化主要采用信息关系抽取方法，然而病理检查报告所具有的语义特殊性，给中文信息关系抽取带来了挑战.为解决上述问题，设计了一种针对病理检查报告的结构化方法，首先通过神经网络语言模型获得病理报告中的同义词表，合并一义多词现象；在此基础上，生成病理检查报告文本的依存关系树，并提出切分短句和信息标注的剪裁策略，以简化初始生成的依存关系树结构，从而使语法关系更加清晰，提高结构化结果的准确度；进而，利用依存句法分析结果从中文检查报告中提取指标及对应指标值，并自动生成结构化模板.实验采用医生真实使用的医疗病理检查报告进行验证，其结果表明：该方法在指标词和对应指标值提取任务中的准确率可以分别达到82.91%和79.11%，为相关研究打下了基础.

Abstract: Most of pathological reports are unstructured texts which can not be directly analyzed by computers. The current researches on structured texts mainly focus on the information extraction. However, the syntactic features of pathological reports are particular, which makes it more difficult to extract information relations. To solve this problem, a novel method of structuralizing pathological reports based on syntactic and semantic features is proposed in this paper. First of all, we construct a synonym lexicon by using neural network language models to eliminate the phenomenon of synonymy. Then the dependency trees are generated based on the preprocessed pathological reports to extract medical examination indices. Meanwhile, we use short-sentence segmentation and annotation as optimized strategies to simplify the structure of dependency trees, which makes the grammatical relations of medical texts clearer and improves the quality of the structured results. Finally the key-value pairs of medical examination indices can be extracted from pathological reports in Chinese, and the structured texts can be generated automatically. Experimental results based on real pathological report data sets show that the performance of the proposed method on medical indices and values extraction achieves 82.91% and 79.11% of accuracy, which provides a solid foundation for related studies in the future.

HTML全文

参考文献(0)

施引文献

资源附件(0)