互联网服务场景下基于机器学习的KPI异常检测综述

尚书一; 李宏佳; 宋晨; 卢至彤; 王利明; 徐震

doi:10.7544/issn1000-1239.202330577

互联网服务场景下基于机器学习的KPI异常检测综述

Survey of Machine Learning-Based KPI Anomaly Detection on Internet-Based Services

摘要

摘要: 关键性能指标（key performance indicator，KPI）异常检测技术是互联网服务智能运维的基础支撑技术. 为了提升KPI异常检测的效率与准确性，基于机器学习的KPI异常检测技术成为近年来学术界与工业界的研究热点. 在综合分析相关研究的基础上，给出了面向互联网服务的KPI异常检测技术框架. 然后，分别针对单变量KPI、多变量KPI和矩阵变量KPI，从挖掘KPI在不同维度域（时间域、度量域、实体域）的依赖模式的角度出发，探讨了用于KPI异常检测的机器学习模型的选择动机. 进一步地，以检测性能目标为导向，详细介绍了以准确性目标为核心的KPI异常检测技术（关注如何提升KPI异常检测模型的准确性）和以多目标平衡为核心的KPI异常检测技术（关注如何平衡理论性能与实际应用目标间的关系）. 最后，梳理了基于机器学习的KPI异常检测技术在KPI监控及预处理、模型通用性、模型可解释性、异常告警管理以及KPI异常检测任务自身局限性5个方面的挑战，同时指出了与之对应的潜在研究方向.

Abstract: Key performance indicator (KPI) anomaly detection is a fundamental technology for artificial intelligence for IT operations (AIOps) of Internet-based services. To improve the efficiency and accuracy of KPI anomaly detection, machine learning-based KPI anomaly detection has become a hotspot in both academia and industry recently. Through synthetically analyzing prior arts in this field, we first provide a technical framework of KPI anomaly detection for Internet-based services. Then, from the perspective of mining KPI’s dependency patterns in different domains (including time domain, metric domain and entity domain), we explore the motivation for model selection of KPI anomaly detection on three KPI types (including univariate KPI, multivariate KPIs and matrix-variate KPIs). Furthermore, guided by the detection performance objectives, we elaborate on KPI anomaly detection techniques from two perspectives: accuracy-centric anomaly detection techniques which focus on how to improve the accuracy of KPI anomaly detection models and multi-objective balancing-centric anomaly detection techniques which focus on how to balance theoretical performance with actual application objectives. Finally, we sort out five challenges on machine learning-based KPI anomaly detection, including KPI monitoring and KPI pre-processing, generality of the model, interpretability of the model, alarm management of anomalies, and limitations of KPI anomaly detection; and we also point out the corresponding potential research directions.

HTML全文

参考文献(126)

施引文献

资源附件(0)