融合常用语的大规模疾病术语图谱构建

张晨童; 张佳影; 张知行; 阮彤; 何萍; 葛小玲

doi:10.7544/issn1000-1239.2020.20190747

融合常用语的大规模疾病术语图谱构建

¹(华东理工大学上海 200237)
²(上海申康医院发展中心上海 200041)
³(复旦大学附属儿科医院上海 201108) (chentong_zhang@163.com)

基金项目: 国家自然科学基金项目(61772201)；国家重点研发计划项目(2018YFC0910500)

详细信息

中图分类号: TP391
计量
- 文章访问数: 794
- HTML全文浏览量: 0
- PDF下载量: 410
出版历程
- 发布日期: 2020-10-31

Construction of Large-Scale Disease Terminology Graph with Common Terms

¹(East China University of Science and Technology, Shanghai 200237)
²(Shanghai Hospital Development Center, Shanghai 200041)
³(Children’s Hospital of Fudan University, Shanghai 201108)

Funds: This work was supported by the National Natural Science Foundation of China (61772201) and the National Key Research and Development Program of China (2018YFC0910500).

摘要

摘要: 国家卫计委要求医疗机构使用国际疾病分类(international classification of diseases, ICD)编码，然而由于临床疾病描述存在大量的常用词，导致电子病历中录入的诊断名称与ICD编码直接映射匹配率低.基于区域健康平台上的真实诊断数据，构建了融合常用语的疾病术语图谱.具体来说，在基于疾病构成成分的规则算法基础上，提出了基于数据增强的BERT(bidirectional encoder representation from transformers)上下位关系识别算法，将5万多个诊断常用语和ICD10(international classification of diseases 10th revision,Chinese version)中的疾病进行同义关系和上下位关系识别,进一步融合了ICD11(international classification of diseases 11th revision,Chinese version)的层次结构，此外，还提出了基于疾病-科室关联图谱的任务分配方法以进行人工校验，最终94 478个疾病实体形成了包含1 460条同义关系、46 508条上下位关系的大规模疾病术语图谱.评估实验表明，基于疾病术语图谱，对临床诊断数据的覆盖率比基于ICD10的直接映射编码的覆盖率提升了75.31%，另外，利用疾病术语图谱自动进行编码疾病相比于医生人工编码会缩短约59.75%的编码时间，且正确率达到85%.
- 常用语 /
- 疾病术语图谱 /
- 国际疾病分类 /
- 关系识别 /
- 校验
Abstract: The National Health Planning Commission requires medical institutions to use the ICD (international classification of diseases) codes. However, due to the large amount of common terms in clinical disease descriptions, the direct matching rate between clinical diagnostic names in electronic medical records and ICD codes is low. Based on the real diagnostic data on the regional healthcare platform, this paper constructs a disease terminology graph fusing common terms. Specifically, this paper proposes a relationship recognition algorithm based on data enhancement which combines the rule algorithm based on the disease components and the pre-training BERT(bidirectional encoder representation from transformers) model. The proposed algorithm identifies synonymy and hypernymy between over 50 000 common terms and diseases in ICD10(international classification of diseases 10th revision,Chinese version), then further fuses the hierarchical structure of ICD11(international classification of diseases 11th revision,Chinese version). Moreover, this paper also proposes a task allocation algorithm based on the disease-department association graph to perform manual verification. Finally, a large-scale disease terminology graph including 1 460 synonyms and 46 508 hypernymy can be formed by 94 478 disease entities. The evaluation experiments show that the coverage of clinical diagnostic data based on disease terminology graph is 75.31% higher than direct mapping based on ICD10. In addition, compared with manual coding by doctors, the automatic coding using disease terminology graph can shorten 59.75% of the encoding time, and the accuracy rate is 85%.
- common terms /
- disease terminology graph /
- ICD(international classification of diseases) /
- relationship recognition /
- verification

HTML全文

参考文献(0)

施引文献(7)

期刊类型引用(3)

1.	商涛，程瑶，陈禄明，邓立宗，蒋太交. 呼吸病学标准医学术语在电子病历中的使用情况调研. 中国科技术语. 2021(04): 53-59 . 百度学术
2.	郑光敏，易天源，唐东昕，贺松. 基于BERT-BiLSTM-CRF模型的中国民族药知识抽取. 武汉大学学报(理学版). 2021(05): 393-402 . 百度学术
3.	戴志宏，郝晓玲. 上下位关系抽取方法及其在金融市场的应用. 数据分析与知识发现. 2021(10): 60-70 . 百度学术