A Label Cleaning Method of ECG Data Based on Abnormality-Feature Patterns

Han Jingyu; Chen Wei; Zhao Jing; Lang Hang; Mao Yi

doi:10.7544/issn1000-1239.202220334

Journal of Computer Research and Development > 2023 > 60(11): 2594-2610. > DOI: 10.7544/issn1000-1239.202220334

Han Jingyu, Chen Wei, Zhao Jing, Lang Hang, Mao Yi. A Label Cleaning Method of ECG Data Based on Abnormality-Feature Patterns[J]. Journal of Computer Research and Development, 2023, 60(11): 2594-2610. DOI: 10.7544/issn1000-1239.202220334

Citation:

PDF (3047 KB)

A Label Cleaning Method of ECG Data Based on Abnormality-Feature Patterns

School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023
Jiangsu Key Laboratory of Big Data Security and Intelligent Processing (Nanjing University of Posts and Telecommunications), Nanjing 210023

Funds: This work was supported by the National Natural Science Foundation of China (62002174).

More Information

Author Bio:
Han Jingyu: born in 1976. PhD, professor. Member of CCF. His main research interests include biomedical information processing, database system, and machine learning

Chen Wei: born in 1995. Master candidate. His main research interests include biomedical information processing and machine learning

Zhao Jing: born in 1996. Master. Her main research interests include machine learning and database systems

Lang Hang: born in 1999. Master candidate. His main research interests include machine learning and bioinformatics

Mao Yi: born in 1985. PhD, lecturer. Her main research interests include biomedical information processing and machine learning
Received Date: April 24, 2022
Revised Date: December 08, 2022
Available Online: July 31, 2023

Graphical Abstract

Abstract

Abstract

Automatic detection of electrocardiogram (ECG) abnormality is a typical multi-label classification problem, which heavily relies on sufficient samples with high-quality abnormality labels for model training. Unfortunately, we often face ECG datasets with partial and incorrect labels, so how to clean weakly-labelled datasets to obtain the clean datasets with all the correct abnormality labels is becoming a pressing concern. Under the assumption that we can have a small-sized example dataset with full and correct labels, we propose an abnormality-feature pattern (AFP) based method to automatically clean the weakly-labelled datasets, thus obtaining all the correct abnormality labels. The cleaning process proceeds with two stages, clustering-based rule construction and iteration-based label cleaning. During the first stage, we construct a set of label inclusion and exclusion rules and a set of binary discriminators by exploiting the different abnormality-feature patterns which are identified through Dirichlet process mixture model (DPMM) clustering. During the second stage, we first identify the relevant abnormalities according to the label inclusion and exclusion rules, and then refine the relevant abnormalities with iterations. AFP method takes advantage of the abnormality-feature patterns shared by the example dataset and weakly-labelled dataset, which is based on both the human intelligence and the correct label information from the weakly-labelled dataset. Further, the method stepwise removes the incorrect labels and fills in the missing ones with an iteration, thus ensuring a reliable cleaning process. The experiments on real and synthetic datasets prove the effectiveness of our method.
- electrocardiogram (ECG),
- multi-label classification,
- abnormality labels,
- abnormality-feature pattern (AFP),
- binary discriminator,
- label cleaning

FullText(HTML)

References (45)

References

[1]	World Health Organization. Cardio-vascular diseases (CVDs) [EB/OL]. [2021-06-11]. https:// www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
[2]	Liu Feifei, Liu Chengyu, Zhao Lina, et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection[J]. Journal of Medical Imaging and Health Informatics, 2018, 8(7): 1368−1373 doi: 10.1166/jmihi.2018.2442
[3]	杨虎. 心肌梗死心电图诊断与进展[M]//心电图专业人员培训教材. 北京: 北京大学医学出版社, 2005: 18−34 Yang Hu. Diagnosis of myocardial infarction in electrocardiogram and recent progress [M]//Course Book of Electrocardiogram Specialty. Beijing: Beijing University Medical Press, 2005: 18−34 (in Chinese)
[4]	田枫,沈旭昆. 弱标签环境下基于语义邻域学习的图像标注[J]. 计算机研究与发展,2014,51(8):1821−1832 Tian Feng, Shen Xukun. Image annotation by semantic neighborhood learning from weakly labeled dataset[J]. Journal of Computer Research and Development, 2014, 51(8): 1821−1832 (in Chinese)
[5]	金林鹏,董军. 面向临床心电图分析的深层学习算法研究[J]. 中国科学:信息科学,2015,45(3):398−416 Jin Linpeng, Dong Jun. Deep learning research on clinical electrocardiogram analysis[J]. SCIENTIA SINICA Informationis, 2015, 45(3): 398−416 (in Chinese)
[6]	郑伟哲,仇鹏,韦娟. 弱标签环境下基于多尺度注意力融合的声音识别检测[J]. 计算机科学,2020,47(5):120−123 Zheng Weizhe, Qiu Peng, Wei Juan. Sound recognition and detection based on multi-scale attention fusion in weak label environment[J]. Computer Science, 2020, 47(5): 120−123 (in Chinese)
[7]	Li Yaoguang, Cui Wei. Identifying the mislabeled training samples of ECG signals using machine learning[J]. Biomedical Signal Processing and Control, 2019, 47: 168−176 doi: 10.1016/j.bspc.2018.08.026
[8]	Pasolli E, Melgani F. Genetic algorithm-based method for mitigating label noise issue in ECG signal classification[J]. Biomedical Signal Processing and Control, 2015, 19: 130−136 doi: 10.1016/j.bspc.2014.10.013
[9]	Clifford G D, Liu Chengyu, Moody B, et al. AF classification from a short single lead ECG recording: The PhysioNet/computing in cardiology challenge 2017[C/OL]//Proc of the 18th Computing in Cardiology(CinC). Piscataway, NJ: IEEE, 2017[2022-02-02]. https://cinc.org/archives/ 2017/pdf/065−469.pdf
[10]	Cristina G V, Alexander B, Oriella G, et al. Two will do: Convolutional neural network with asymmetric loss, self-learning label correction, and hand-crafted features for imbalanced multi-label ECG data classification[C/OL]//Proc of the 22nd Computing in Cardiology. Piscataway, NJ: IEEE, 2021[2022-02-02]. https://www.cinc.org/archives/ 2021/pdf/CinC2021−024.pdf
[11]	Frenay B, Verleysen M. Classification in the presence of label noise: A survey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(5): 845−869 doi: 10.1109/TNNLS.2013.2292894
[12]	Han Yufei, Sun Guolei, Shen Yun, et al. Multi-label learning with highly incomplete data via collaborative embedding[C]//Proc of the 24th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2018: 1494−1503
[13]	Wu Lei, Jin Rong, Jain A K. Tag completion for image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(3): 716−727 doi: 10.1109/TPAMI.2012.124
[14]	Zhou Zhihua. A brief introduction to weakly supervised learning[J]. National Science Review, 2017, 5(1): 44−53
[15]	Varma P, Ré C. Snuba: Automating weak supervision to label training data[J]. Proceedings of the VLDB Endowment, 2018, 12(3): 223−236 doi: 10.14778/3291264.3291268
[16]	Lee W S, Liu Bing. Learning with positive and unlabeled examples using weighted logistic regression[C]//Proc of the 20th Int Conf on Machine Learning. Palo Alto, CA: AAAI, 2003: 448−455
[17]	Na B, Kim H, Song K, et al. Deep generative positive-unlabeled learning under selection bias[C]// Proc of the 29th ACM Int Conf on Information and Knowledge Management. New York: ACM, 2020: 1155–1164
[18]	Dong Haochen, Li Yufeng, Zhou Zhihua. Learning from semi-supervised weak-label data [C]// Proc of the 32nd AAAI on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 2926−2933
[19]	丁家满,刘楠,周蜀杰,等. 基于正则化的半监督弱标签分类方法[J]. 计算机学报,2022,45(1):69−81 Ding Jiaman, Liu Nan, Zhou Shujie, et al. Semi-supervised weak-label classification method by regularization[J]. Chinese Journal of Computers, 2022, 45(1): 69−81 (in Chinese)
[20]	Ding Hu, Xu Jinhui. Random gradient descent tree: A combinatorial approach for SVM with outliers [C]// Proc of the 29th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2015: 2561−2567
[21]	Xu Guibiao, Cao Zheng, Hu Baogang, et al. Robust support vector machines based on the rescaled hinge loss function[J]. Pattern Recognition, 2017, 63: 139−148 doi: 10.1016/j.patcog.2016.09.045
[22]	He Fengxiang, Liu Tongliang, Geoffrey I W, et al. Instance-dependent PU learning by Bayesian optimal relabeling [J]. arXiv preprint, arXiv: 1808. 02180, 2018
[23]	Basile T M A, Mauro N D, Esposito F, et al. Density estimators for positive-unlabeled learning[M]// New Frontiers in Mining Complex Patterns. Berlin: Springer, 2017: 49−64
[24]	Chaudhari S, Shevade S. Learning from positive and unlabelled examples using maximum margin clustering[C]// LNCS 7665: Proc of the 19th Int Conf on Neural Information Processing. Berlin: Springer, 2012: 465−473
[25]	Gong Chen, Shi Hong, Liu Tongliang, et al. Loss decomposition and centroid estimation for positive and unlabeled learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(3): 918−932 doi: 10.1109/TPAMI.2019.2941684
[26]	Zhang Minling, Zhou Zhihua. A review on multi-label learning algorithms[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8): 1819−1837 doi: 10.1109/TKDE.2013.39
[27]	Gibaja E, Ventura S. A tutorial on multilabel learning[J]. ACM Computing Surveys, 2015, 47(3): 1−38
[28]	Boutell M R, Luo Jiebo, Shen Xipeng, et al. Learning multi-label scene classification[J]. Pattern Recognition, 2004, 37(9): 1757−1771 doi: 10.1016/j.patcog.2004.03.009
[29]	Read J, Pfahringer B, Holmes G, et al. Classifier chains for multi-label classification[J]. Machine Learning, 2011, 85(3): 333−359 doi: 10.1007/s10994-011-5256-5
[30]	Fürnkranz J, Hüllermeier E, Mencía E L, et al. Multilabel classification via calibrated label ranking[J]. Machine Learning, 2008, 73(2): 133−153 doi: 10.1007/s10994-008-5064-8
[31]	Tsoumakas G, Katakis I, Vlahavas I. Random K-Labelsets for multi-label classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(7): 1079−1089 doi: 10.1109/TKDE.2010.164
[32]	Zhang Minling, Zhou Zhihua. ML-KNN: A lazy learning approach to multi-label learning[J]. Pattern Recognition, 2007, 40(7): 2038−2048 doi: 10.1016/j.patcog.2006.12.019
[33]	Clare A, King R D. Knowledge discovery in multi-label phenotype data[C]//Proc of the 5th European Conf on Principles of Data Mining and Knowledge Discovery. Berlin: Springer, 2001: 42−53
[34]	Elisseeff A, Weston J. A kernel method for multi-labelled classification[C]// Proc of the 14th Int Conf on Neural Information Processing Systems: Natural and Synthetic. Cambridge, MA: MIT Press, 2001: 681−687
[35]	李峰,苗夺谦,张志飞,等. 基于互信息的粒化特征加权多标签学习K近邻算法[J]. 计算机研究与发展,2017,54(5):1024−1035 Li Feng, Miao Duoqian, Zhang Zhifei, et al. Mutual information based granular feature weighted k-nearest neighbors algorithm for multi-label learning[J]. Journal of Computer Research and Development, 2017, 54(5): 1024−1035 (in Chinese)
[36]	Liu Tongliang, Tao Dacheng. Classification with noisy labels by importance reweighting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(3): 447−461 doi: 10.1109/TPAMI.2015.2456899
[37]	Himanshu K, Naresh M, Sastry P S. Robust learning of multi-label classifiers under label noise[C]// Proc of the 7th ACM India Special Interest Group on Knowledge Discovery and Data Mining. New York: ACM, 2020: 90−97
[38]	陈庆强,王文剑,姜高霞. 基于数据分布的标签噪声过滤[J]. 清华大学学报:自然科学版,2019,59(4):262−269 Chen Qingqiang, Wang Wenjian, Jiang Gaoxia. Label noise filtering based on the data distribution[J]. Journal of Tsinghua University: Science and Technology, 2019, 59(4): 262−269 (in Chinese)
[39]	Han Jingyu, Sun Guangpeng, Song Xinhai, et al. Detecting ECG abnormalities using an ensemble framework enhanced by Bayesian belief network[J]. Biomedical Signal Processing and Control, 2022, 72(A): 103320
[40]	Liu F T, Ting K M, Zhou Zhihua. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): 1−39
[41]	Ferguson T S. A Bayesian analysis of some nonparametric problems[J]. The Annals of Statistics, 1973, 1(2): 209−230
[42]	David M B, Michael I J. Variational methods for the Dirichlet process[C]// Proc of the 21st Int Conf on Machine Learning. New York: ACM, 2004: 89−96
[43]	Černý V. Thermo dynamical approach to the traveling salesman problem: An efficient simulation algorithm[J]. Journal of Optimization Theory and Applications, 1985, 45: 41−51 doi: 10.1007/BF00940812
[44]	Han Jiawei, Kamber M, Pei Jian. Data Mining: Concepts and Techniques[M]. 3rd ed. San Francisco: Morgan Kaufmann, 2012: 38−47
[45]	George M, Roger M. MIT-BIH Arrhythmia Database [DB/OL]. (2005-02-24)[2021-03-07]. https://physionet.org/content/mitdb/1.0.0/

[1]	Zhu Rongjiang, Shi Yuheng, Yang Shuo, Wang Ziyi, Wu Xinxiao. Open-Vocabulary Multi-Label Action Recognition Guided by LLM Knowledge[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440522
[2]	Huang Yiwang, Huang Yuxin, Liu Sheng. A Lightweight Noise Label Learning Method Based on Online Distillation[J]. Journal of Computer Research and Development, 2024, 61(12): 3121-3133. DOI: 10.7544/issn1000-1239.202330382
[3]	Dai Chenglong, Li Guanghui, Li Dong, Shen Jiahua, Pi Dechang. Electroencephalogram Clustering with Multiple Regularization Constrained Pseudo Label Propagation Optimization[J]. Journal of Computer Research and Development, 2024, 61(1): 156-171. DOI: 10.7544/issn1000-1239.202220295
[4]	Wang Hang, Tian Shengzhao, Tang Qing, Chen Duanbing. Few-Shot Image Classification Based on Multi-Scale Label Propagation[J]. Journal of Computer Research and Development, 2022, 59(7): 1486-1495. DOI: 10.7544/issn1000-1239.20210376
[5]	Wang Jina, Chen Junhua, Gao Jianhua. ECC Multi-Label Code Smell Detection Method Based on Ranking Loss[J]. Journal of Computer Research and Development, 2021, 58(1): 178-188. DOI: 10.7544/issn1000-1239.2021.20190836
[6]	Du Ming, Yang Yun, Zhou Junfeng, Chen Ziyang, Yang Anping. Efficient Methods for Label-Constraint Reachability Query[J]. Journal of Computer Research and Development, 2020, 57(9): 1949-1960. DOI: 10.7544/issn1000-1239.2020.20190569
[7]	Song Pan, Jing Liping. Exploiting Label Relationships in Multi-Label Classification with Neural Networks[J]. Journal of Computer Research and Development, 2018, 55(8): 1751-1759. DOI: 10.7544/issn1000-1239.2018.20180362
[8]	Li Feng, Miao Duoqian, Zhang Zhifei, Zhang Wei. Mutual Information Based Granular Feature Weighted k-Nearest Neighbors Algorithm for Multi-Label Learning[J]. Journal of Computer Research and Development, 2017, 54(5): 1024-1035. DOI: 10.7544/issn1000-1239.2017.20160351
[9]	Zhang Zhenhai, Li Shining, Li Zhigang, and Chen Hao. Multi-Label Feature Selection Algorithm Based on Information Entropy[J]. Journal of Computer Research and Development, 2013, 50(6): 1177-1184.
[10]	Mao Xianling, He Jing, and Yan Hongfei. A Survey of Web Page Cleaning Research[J]. Journal of Computer Research and Development, 2010, 47(12).

Cited By

Cited by

Periodical cited type(2)

1.	孟祥福，石皓源. 基于Transformer模型的时序数据预测方法综述. 计算机科学与探索. 2025(01): 45-64 .
2.	伍阳，陈科基. 物联网海量不均衡数据组内方差SNM清洗算法. 现代电子技术. 2025(03): 124-128 .