数据调试综述

李晨阳; 马超红; 孟小峰

doi:10.7544/issn1000-1239.202550195

摘要: 人工智能的蓬勃发展，对医疗健康、生物信息、金融服务等各领域产生深远影响。人工智能应用的主要范式是构建机器学习模型，探索数据中的规则和模式，以用于推理和决策。人工智能系统的有效性和效率取决于2个关键方面：其一是模型方面（以模型为中心），包括增强网络结构，如RNN到LSTM的转变、模型超参数调优等；其二是数据方面（以数据为中心），如标准化数据格式、增大数据量、减少数据噪声等。一直以来，调试人工智能系统主要侧重于优化模型。然而，以社交网络和电子商务为代表的数字化时代的到来产生庞大且多样的数据，使得以模型为中心的调试已无法满足人们对人工智能系统的需求。因此，研究界和工业界将注意力从模型转向数据，以弥补这一差距。为此，“数据调试”（data debugging）应运而生。与优化模型不同，数据调试侧重检查数据，即理解错误数据在机器学习管道的各阶段对下游任务的影响，进而调试相应错误以提高模型性能。基于此，在全面调研数据调试相关工作的基础上，首先，提出数据调试研究框架，根据数据调试方法与机器学习管道的交互，将现有方法分为封闭式数据调试、浸入式数据调试和混合式数据调试3类。接着，详细概述本领域的相关工作。然后，对数据调试方法进行实验评估，同时总结该研究领域常用的数据集和评价指标。最后，指出数据调试面临的挑战及未来发展方向。

Abstract: The booming of artificial intelligence (AI) plays a crucial role in various fields (e.g., healthcare, bioinformatics, and financial services). The main paradigm of AI applications involves building machine learning models to explore insights and patterns within data, which are then utilized for reasoning and decision-making. The effectiveness and efficiency of AI systems hinge on two key factors: the first is the model aspect (model-centric), involving the enhancement of network structure like the transitioning from RNN to LSTM, and the optimization of model hyperparameters; the second is the data aspect (data-centric), such as standardizing data formats, increasing data size, reducing data noises. For a long time, debugging AI systems primarily focuses on optimizing models. However, the rise of the digital era, characterized by activities such as social networking and e-commerce, has produced vast and varied data sets, making model-centric debugging fall short of fulfilling the requirements of modern applications. Consequently, both research and industry communities have redirected their focus from models to data to address this gap. To this end, “data debugging” has emerged. Rather than tuning model, data debugging places emphasis on scrutinizing data, which understands the impact of data errors on downstream tasks at each stage of machine learning pipeline, and then debug the corresponding errors to improve model performance. On the basis of a comprehensive survey of data debugging related work, the research framework for data debugging is first proposed. According to the interaction between data debugging methods and machine learning pipeline, the existing methods are categorized into closed data debugging, immersive data debugging, and hybrid data debugging. Then, we thoroughly provide a detailed overview of related work in this field. Next, we empirically evaluate data debugging methods, and summarize the commonly used datasets and evaluation metrics in this research area. Finally, the challenges and future directions of data debugging are pointed out.

数据调试综述

Survey on Data Debugging