Advanced Search
    The Survey on Data Debug[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550195
    Citation: The Survey on Data Debug[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550195

    The Survey on Data Debug

    • The booming of Artificial Intelligence (AI) plays a crucial role in various fields (e.g., healthcare, bioinformatics, and financial services). The main paradigm of AI applications involves building machine learning models to explore insights and patterns within data, which are then utilized for reasoning and decision-making. The effectiveness and efficiency of AI systems hinge on two key facets: the first is the model aspect (model-centric), involving the enhancement of network structure like the transitioning from RNN to LSTM, and the optimization of model hyperparameters; the second is the data aspect (data-centric), such as standardizing data formats, increasing data size, reducing data noises. For a long time, debugging AI systems primarily focused on optimizing models (i.e., the first aspect). However, the rise of the digital era, characterized by activities such as social networking and e-commerce, has produced vast and varied data sets, making model-centric debugging fall short of fulfilling the requirements of modern applications. Consequently, both research and industry communities have redirected their focus from models to data to address this gap. To this end, "data debugging" has emerged. Rather than tuning model, data debugging places emphasis on scrutinizing data, that understanding the impact of data errors on downstream tasks at each stage of machine learning pipeline, and then debug the corresponding errors to improve model performance. On the basis of a comprehensive survey of data debugging related work, the research framework for data debugging is first proposed. According to the interaction between data debugging methods and machine learning pipeline, the existing methods are categorized into closed debugging, immersive debugging, and hybrid debugging. Then, we thoroughly provide a detailed overview of related work in this field. Next, we empirically evaluate data debugging methods, and summarize the commonly used datasets and evaluation metrics in this research area. Finally, the challenges and future directions of data debugging are pointed out.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return