基于异常特征模式的心电数据标签清洗方法

韩京宇; 陈伟; 赵静; 郎杭; 毛毅

doi:10.7544/issn1000-1239.202220334

基于异常特征模式的心电数据标签清洗方法

A Label Cleaning Method of ECG Data Based on Abnormality-Feature Patterns

摘要

摘要: 心电图（electrocardiogram, ECG）异常的自动检测是一个典型的多标签分类问题，训练分类器需要大量有高质量标签的样本. 但心电数据集异常标签经常缺失或错误，如何清洗弱标签得到干净的心电数据集是一个亟待解决的问题. 在一个标签完整且准确的示例数据集辅助下，提出一种基于异常特征模式 (abnormality-feature pattern, AFP) 的方法对弱标签心电数据进行标签清洗，以获取所有正确的异常标签. 清洗分2个阶段，即基于聚类的规则构造和基于迭代的标签清洗. 在第1阶段，通过狄利克雷过程混合模型（Dirichlet process mixture model, DPMM）聚类，识别每个异常标签对应的不同特征模式，进而构建异常发现规则、排除规则和1组二分类器. 在第2阶段，根据发现和排除规则辨识初始相关标签集，然后根据二分类器迭代扩展相关标签并排除不相关标签. AFP方法捕捉了示例数据集和弱标签数据集的共享特征模式，既应用了人的知识，又充分利用了正确标记的标签；同时，渐进地去除错误标签和填补缺失标签，保证了标签清洗的可靠性. 真实和模拟数据集上的实验证明了AFP方法的有效性.

Abstract: Automatic detection of electrocardiogram (ECG) abnormality is a typical multi-label classification problem, which heavily relies on sufficient samples with high-quality abnormality labels for model training. Unfortunately, we often face ECG datasets with partial and incorrect labels, so how to clean weakly-labelled datasets to obtain the clean datasets with all the correct abnormality labels is becoming a pressing concern. Under the assumption that we can have a small-sized example dataset with full and correct labels, we propose an abnormality-feature pattern (AFP) based method to automatically clean the weakly-labelled datasets, thus obtaining all the correct abnormality labels. The cleaning process proceeds with two stages, clustering-based rule construction and iteration-based label cleaning. During the first stage, we construct a set of label inclusion and exclusion rules and a set of binary discriminators by exploiting the different abnormality-feature patterns which are identified through Dirichlet process mixture model (DPMM) clustering. During the second stage, we first identify the relevant abnormalities according to the label inclusion and exclusion rules, and then refine the relevant abnormalities with iterations. AFP method takes advantage of the abnormality-feature patterns shared by the example dataset and weakly-labelled dataset, which is based on both the human intelligence and the correct label information from the weakly-labelled dataset. Further, the method stepwise removes the incorrect labels and fills in the missing ones with an iteration, thus ensuring a reliable cleaning process. The experiments on real and synthetic datasets prove the effectiveness of our method.

HTML全文

参考文献(45)

施引文献

资源附件(0)