Loading [MathJax]/jax/output/SVG/jax.js
  • 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

基于自适应k近邻的时间序列异常模式识别

王玲, 周南, 申鹏

王玲, 周南, 申鹏. 基于自适应k近邻的时间序列异常模式识别[J]. 计算机研究与发展, 2023, 60(1): 125-139. DOI: 10.7544/issn1000-1239.202111062
引用本文: 王玲, 周南, 申鹏. 基于自适应k近邻的时间序列异常模式识别[J]. 计算机研究与发展, 2023, 60(1): 125-139. DOI: 10.7544/issn1000-1239.202111062
Wang Ling, Zhou Nan, Shen Peng. Time Series Anomaly Pattern Recognition Based on Adaptive k Nearest Neighbor[J]. Journal of Computer Research and Development, 2023, 60(1): 125-139. DOI: 10.7544/issn1000-1239.202111062
Citation: Wang Ling, Zhou Nan, Shen Peng. Time Series Anomaly Pattern Recognition Based on Adaptive k Nearest Neighbor[J]. Journal of Computer Research and Development, 2023, 60(1): 125-139. DOI: 10.7544/issn1000-1239.202111062
王玲, 周南, 申鹏. 基于自适应k近邻的时间序列异常模式识别[J]. 计算机研究与发展, 2023, 60(1): 125-139. CSTR: 32373.14.issn1000-1239.202111062
引用本文: 王玲, 周南, 申鹏. 基于自适应k近邻的时间序列异常模式识别[J]. 计算机研究与发展, 2023, 60(1): 125-139. CSTR: 32373.14.issn1000-1239.202111062
Wang Ling, Zhou Nan, Shen Peng. Time Series Anomaly Pattern Recognition Based on Adaptive k Nearest Neighbor[J]. Journal of Computer Research and Development, 2023, 60(1): 125-139. CSTR: 32373.14.issn1000-1239.202111062
Citation: Wang Ling, Zhou Nan, Shen Peng. Time Series Anomaly Pattern Recognition Based on Adaptive k Nearest Neighbor[J]. Journal of Computer Research and Development, 2023, 60(1): 125-139. CSTR: 32373.14.issn1000-1239.202111062

基于自适应k近邻的时间序列异常模式识别

基金项目: 国家自然科学基金项目(62076025, 61572073)
详细信息
    作者简介:

    王玲: 1974年生.博士,教授.主要研究方向为数据挖掘和机器学习

    周南: 1997年生.硕士.主要研究方向为数据挖掘

    申鹏: 2000年生.硕士研究生.主要研究方向为数据挖掘

  • 中图分类号: TP391

Time Series Anomaly Pattern Recognition Based on Adaptive k Nearest Neighbor

Funds: This work was supported by the National Natural Science Foundation of China (62076025, 61572073).
  • 摘要:

    时间序列作为数据的典型代表,被广泛应用于许多研究领域. 时间序列异常模式代表了一种特殊情况的出现,在许多领域都具有重要意义. 现有的时间序列异常模式识别算法大多只是单纯检测异常子序列,忽略了异常子序列的类别区分问题,且许多参数都需要人为设置.为此提出了一种基于自适应k近邻的异常模式识别算法(anomaly pattern recognition algorithm based on adaptive k nearest neighbor, APAKN). 首先,确定各子序列的自适应k近邻值,引入自适应距离比计算子序列的相对密度,确定异常分数;然后提出一种基于最小方差的自适应阈值方法确定异常阈值,检测出所有异常子序列;最后,对异常子序列进行聚类,所得聚类中心即为具有不同变化趋势的异常模式. 整个算法过程在无需设置任何参数的情况下,不仅解决了密度不平衡问题,还精简了传统基于密度异常子序列检测算法的步骤,实现良好的异常模式识别效果. 在时间序列数据集合UCR的10个数据集上的实验结果表明, 提出算法在无需设置参数的情况下,在异常子序列检测和异常子序列聚类问题中都表现良好.

    Abstract:

    As a typical representative of data, time series is widely used in many research fields. The time series anomaly pattern represents the emergence of a special situation, and is of great significance in many fields. Most of the existing time series anomaly pattern recognition algorithms simply detect anomaly subsequences, ignoring the problem of distinguishing the types of anomaly subsequences, and many parameters need to be set manually. In this paper, an anomaly pattern recognition algorithm based on adaptive k nearest neighbor(APAKN) is proposed. Firstly, the adaptive neighbor value k of each subsequence is determined, and an adaptive distance ratio is introduced to calculate the relative density of the subsequence to determine the anomaly score. Then, an adaptive threshold method based on minimum variance is proposed to determine the anomaly threshold and detect all anomaly subsequences. Finally, the anomaly subsequences are clustered, and the obtained cluster centers are anomaly patterns with different changing trends. The whole algorithm process not only solves the density imbalance problem without setting any parameters, but also simplifies the steps of the traditional density-based anomaly subsequence detection algorithm to achieve a good anomaly pattern recognition effect. Experimental results on the 10 data sets of UCR show that the proposed algorithm performs well in detecting anomaly subsequences and clustering anomaly subsequences without setting parameters.

  • 计算机存储系统承载数据,是信息平台的核心基础设施. 近年来,全球数据规模爆发式增长,计算机存储系统面临着高速数据访问、海量数据存储以及存储服务质量保障的挑战. 同时,由于新型硬件(如NVMe SSD、持久内存、异构加速设备等)的发展与成熟,存储系统技术研究面临着诸多新的机遇.

    基于上述背景,为促进存储领域的技术交流,《计算机研究与发展》推出了本期存储专题. 本期专题收录了6篇论文,分别展示了新硬件环境下存储系统设计和大规模数据存储服务质量保障等存储领域关注热点的研究现状和最新研究成果,希望能为从事相关工作的读者提供借鉴和帮助.

    周小晖等作者的论文“基于融合学习的无监督多维时间序列异常检测”针对多维时间序列异常检测效果差的问题,提出了一种基于融合学习的无监督多维时间序列异常检测方法. 该方法同时对多维时间序列的数据局部特征和数据全局特征进行建模,并基于重构误差检测异常,提升了异常检测效果.

    刘扬等作者的论文“ZB+ -tree:一种 ZNS SSD 感知的新型索引结构”针对传统的 B+ -tree 索引结构不适配 ZNS SSD 的问题,提出了ZNS SSD感知的ZB+ -tree索引结构. 该索引结构通过将索引节点在常规Zone和顺序Zone分散存储,实现了运行时间和空间利用率指标的提升.

    屠要峰等作者的论文“UStore:面向新型硬件的统一存储系统”为适配 NVMe SSD、持久内存、异构加速设备等新型硬件的特性,提出了一种兼容多种存储介质的统一存储系统 UStore. 该存储系统包括与物理存储介质形态解耦的元数据设计、高效的数据管理机制和更新策略,充分发挥了存储硬件的特性和性能.

    杨勇鹏等作者的论文“一种 wandering B+ tree 问题解决方法”针对日志结构存储系统中B+ tree树结点异地更新会导致树结构递归更新的问题,提出 IBT B+ tree 的解决方法. 该方法将树结点逻辑索引和物理地址均存放在树结构中,同时引入 dirty 链表设计和非递归更新的 IBT B+ tree 下刷算法,实现在不引入额外开销的条件下解决wandering B+ tree的问题.

    文宇鸿等作者的论文“多租户固态盘服务质量保障技术综述”深入分析了多租户固态盘服务质量保障面临的性能干扰、性能不公平及总体性能损失问题,分类介绍了以保障性能隔离、性能公平、优化总体性能为目标的研究工作及技术演进方向,总结了多租户固态盘服务质量保障技术的研究现状并对未来研究方向进行了展望.

    胡浩等作者的论文“新型内存硬件环境中的事务管理系统综述”全面总结了新型硬件环境下的事务管理系统,阐述了当前基于新型硬件事务管理系统的技术路线,重点剖析了硬件事务内存和非易失性存储硬件下的事务管理系统的优势和不足,指明了新型硬件环境中事务管理系统潜在的发展方向以及面临的挑战.

    本专题所录用的6篇论文中,1篇论文重点关注云系统中多维时间序列的故障检测,3篇论文重点关注新硬件环境下的存储系统设计及索引结构设计,2篇论文对基于新型硬件的事务管理系统和多租户固态盘服务质量保障技术进行了综述. 由于专题篇幅有限等原因,本专题无法全面覆盖存储领域各方面的最新研究进展,不当之处请同行学者批评指正! 感谢各位作者、审稿专家和编辑部的全力支持和辛勤付出!

    舒继武 (清华大学)

    王意洁 (国防科技大学)

    2023年2月

  • 图  1   子序列分布示意图

    Figure  1.   Schematic diagram of subsequence distribution

    图  2   APAKN的总体工作流程

    Figure  2.   The overall workflow of the APAKN

    图  3   子序列在不同r值时的近邻情况

    Figure  3.   Neighbors of subsequences at different values of r

    图  4   密度不平衡数据

    Figure  4.   Density imbalance data

    图  5   异常子序列分布示意图

    Figure  5.   Schematic diagram of anomaly subsequence distribution

    图  6   UCR部分时间子序列及其异常得分

    Figure  6.   UCR partial time subsequences and anomaly scores

    图  7   50words数据集在不同类别下的异常子序列示例和异常模式

    Figure  7.   Examples of anomaly subsequences and anomaly patterns in different categories of the 50words dataset

    图  8   CBF数据集在不同类别下的异常子序列示例和异常模式

    Figure  8.   Examples of anomaly subsequences and anomaly patterns in different categories of the CBF dataset

    表  1   k = 4时子序列的k近邻域与反向k近邻域

    Table  1   k Near Neighbors Domain and Reverse k Near Neighbors Domain of Subsequences when k = 4

    子序列k近邻域反向k近邻域反向k近邻个数
    s1{s3,s2,s5,s4}0
    s2{s4,s3,s5,s6}{s1,s3,s4,s6}4
    s3{s2,s5,s4,s7}{s1,s2,s4,s5,s7}5
    s4{s2,s5,s6,s3}{s1,s2,s3,s5,s6,s7}6
    s5{s4,s7,s3,s6}{s1,s2,s3,s4,s6,s7}6
    s6{s4,s5,s2,s7}{s2,s4,s5,s7}4
    s7{s5,s6,s4,s3}{s3,s5,s6}3
    下载: 导出CSV

    表  2   s8, s11, s14相对于其第8近邻的自适应距离比

    Table  2   Adaptive Distance Ratio of s8, s11, s14 Relative to Its 8th Nearest Neighbor

    子序列sisi第8近邻s8is8i自适应k近邻值k8is8i的第k8i近邻自适应距离比
    s8s96s10adr(s8,s9)=ddtw(s8,s9)ddtw(s9,s10)=1.833
    s11s123s13adr(s11,s12)=ddtw(s11,s12)ddtw(s12,s13)=4.266
    s14s156s16adr(s14,s15)=ddtw(s14,s15)ddtw(s15,s16)=1.722
    下载: 导出CSV

    表  3   实验数据集

    Table  3   Experimental Data Sets

    数据集子序列
    长度
    类别
    个数
    正常类
    别号
    异常类
    别号
    子序列
    个数
    异常率/%
    50words2705024,10,4390520
    ECG5000140512,5500010
    FaceAll1311412,3,5225015
    Lighting7319753,4,614335
    Trace275412,3,420030
    Two_Patterns128431,2,4111413
    CBF128312,393018
    Synthetic_Control60621,3,4,5,660025
    StarLightCurves1024331,292367
    ElectricDevices96751,2,3,4166372
    下载: 导出CSV

    表  4   APAKN的km值和对比算法的最佳参数值

    Table  4   km Value of APAKN and the Optimal Parameter Values of Comparison Algorithms

    数据集APAKNDBADSFODLDE
    kmtoptkopthopt
    50words1825924
    ECG500039312930
    FaceAll2520915
    Lighting717151615
    Trace15111016
    Two_Patterns54271418
    CBF66331730
    Synthetic_Control24161110
    StarLightCurves74867859
    ElectricDevices1569115388
    下载: 导出CSV

    表  5   各算法在 10 个数据集上的精确率和召回率对比

    Table  5   Comparison of Accuracy and Recall of Each Algorithm on 10 Data Sets

    数据集APAKNDBADSFODLDE
    精确率召回率精确率召回率精确率召回率精确率召回率
    50words1.0001.0000.8450.7420.4621.0001.0000.673
    ECG50000.8960.8750.6630.6190.6501.0000.8520.642
    FaceAll0.9121.0000.7960.8420.8571.0000.7430.685
    Lighting71.0000.9170.8740.9030.5790.9170.8450.683
    Trace0.9231.0000.9111.0000.7270.6671.0000.967
    Two_Patterns0.9540.9910.8520.8000.4440.5330.8450.667
    CBF1.0001.0000.8600.9140.5581.0001.0000.772
    Synthetic_Control1.0001.0000.7570.6370.7351.0001.0000.700
    StarLightCurves0.9120.8960.9120.8450.7740.6850.8750.689
    ElectricDevices0.9230.8510.8450.7980.8550.7620.7960.744
    注:黑体数值表示最佳值.
    下载: 导出CSV

    表  6   各算法在 10 个数据集上的准确率和F值对比

    Table  6   Comparison of Accuracy and F Value of Each Algorithm on 10 Data Sets

    数据集APAKNDBADSFODLDE
    准确率F准确率F准确率F准确率F
    50words1.0001.0000.9140.7900.8510.6320.9010.805
    ECG50000.9960.8850.9230.6400.9730.7880.8650.732
    FaceAll0.9800.9530.8020.8180.9610.9230.9930.713
    Lighting70.9740.9570.8260.8880.7690.7100.8750.755
    Trace0.9710.9600.8710.9530.8000.6960.9630.983
    Two_Patterns1.0000.9720.8360.8250.8760.4850.8960.746
    CBF1.0001.0000.9310.8860.9400.7160.8750.871
    Synthetic_Control1.0001.0000.8600.6910.8800.8470.6980.823
    StarLightCurves0.8950.9040.8890.8770.8520.7270.8570.771
    ElectricDevices0.9240.8860.8520.8210.8240.8060.8690.769
    注:黑体数值表示最佳值.
    下载: 导出CSV

    表  7   各算法在 10 个数据集上的ARI和NMI对比

    Table  7   Comparison of ARI and NMI of Each Algorithm on 10 Data Sets

    数据集APAKNkmlShapeDDCKShape
    EARIENMIEARIENMIEARIENMIEARIENMI
    50words0.8140.8250.5620.5280.5240.6580.7520.635
    ECG50000.8270.7680.6620.6120.5860.6930.8250.698
    FaceAll1.0001.0000.8140.8020.7250.5470.4080.662
    Lighting70.8360.9050.6350.6200.4330.6230.6890.814
    Trace0.5220.7340.5860.5320.4250.3890.3680.647
    Two_Patterns1.0001.0000.6330.3960.3620.4120.4530.504
    CBF0.9250.8140.7550.8520.6150.5780.8470.689
    Synthetic_Control0.4570.6670.6430.6140.3030.3390.4780.713
    StarLightCurves0.9560.7560.7890.8120.5460.6320.8740.684
    ElectricDevices0.8570.8590.4780.5320.7220.5660.6470.684
    注:黑体数值表示最佳值.
    下载: 导出CSV
  • [1]

    O'Leary B, Reiners J J, Xu Xiaohong, et al. Identification and influence of spatio-temporal outliers in urban air quality measurements[J]. Science of The Total Environment, 2016, 573: 55−65

    [2]

    Qin Yu, Lou Yuansheng. Hydrological time series anomaly pattern detection based on isolation forest[C] //Proc of the 3rd Information Technology, Networking, Electronic and Automation Control Conf (ITNEC). Piscataway, NJ: IEEE, 2019: 1706−1710

    [3]

    Wang Jibin, Wang Ping, Wang Suping. Automated detection of atrial fibrillation in ECG signals based on wavelet packet transform and correlation function of random process[J]. Biomedical Signal Processing and Control, 2020, 55: 101662

    [4]

    Liang Haoran, Song Lei, Wang Jianxing, et al. Robust unsupervised anomaly detection via multi-time scale DCGANs with forgetting mechanism for industrial multivariate time series[J]. Neurocomputing, 2021, 423: 444−462

    [5]

    Wang Xing, Lin J, Patel N, et al. Exact variable-length anomaly detection algorithm for univariate and multivariate time series[J]. Data Mining and Knowledge Discovery, 2018, 32(6): 1806−1844 doi: 10.1007/s10618-018-0569-7

    [6] 苏卫星,朱云龙,刘芳,等. 时间序列异常点及突变点的检测算法[J]. 计算机研究与发展,2014,51(4):781−788

    Su Weixing, Zhu Yunlong, Liu Fang, et al. A detection algorithm of outlier and mutation points in time series[J]. Journal of Computer Research and Development, 2014, 51(4): 781−788 (in Chinese)

    [7] 刘芳,齐建鹏,于彦伟,等. 基于密度的Top-n局部异常点快速检测算法[J]. 自动化学报,2019,45(9):1756−1771 doi: 10.16383/j.aas.c180425

    Liu Fang, Qi Jianpeng, Yu Yanwei, et al. A fast algorithm for density-based Top-n local outlier detection[J]. Acta Automatica Sinica, 2019, 45(9): 1756−1771 (in Chinese) doi: 10.16383/j.aas.c180425

    [8]

    Hyndman R J, Wang E, Laptev N. Large-scale unusual time series detection[C] //Proc of the 15th IEEE Int Conf on Data Mining Workshop (ICDMW). Piscataway, NJ: IEEE, 2015: 1616−1619

    [9]

    Senin P, Lin J, Wang Xing, et al. Time series anomaly discovery with grammar-based compression[C] //Proc of the 18th Int Conf on Extending Database Technology(EDBT). New York: ACM, 2015: 481−492

    [10]

    Boniol P, Palpanas T. Series2graph: Graph-based subsequence anomaly detection for time series[J]. Proceedings of the VLDB Endowment, 2020, 13(12): 1821−1834 doi: 10.14778/3407790.3407792

    [11]

    Cheng Kaiwen, Chen Yietarng, Fang Wenhsien. An efficient subsequence search for video anomaly detection and localization[J]. Multimedia Tools and Applications, 2016, 75: 15101−15122

    [12]

    Hawkins D. Identification of outliers[M] //Monographs on Applied Probability and Statistics. Berlin: Springer, 1980: 51−73

    [13]

    Huo Wunjun, Wang Wei, Li Wen. Anomalydetect: An online distance-based anomaly detection algorithm[C] //Proc of the Int Conf on Web Services. Berlin: Springer, 2019: 63−79

    [14]

    Zhang Pengcheng, Xiao Yan, Zhu Yuelong, et al. A new symbolization and distance measure based anomaly mining approach for hydrological time series[J]. International Journal of Web Services Research, 2016, 13(3): 26−45 doi: 10.4018/IJWSR.2016070102

    [15]

    Zheng Dequan, Li Fenghuan, Zhao Tiejun. Self-adaptive statistical process control for anomaly detection in time series[J]. Expert Systems with Applications, 2016, 57: 324−336

    [16]

    Izakian H, Pedrycz W. Anomaly detection in time series data using a fuzzy c-means clustering[C] //Proc of the Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS). Piscataway, NJ: IEEE, 2013: 1513−1518

    [17]

    Medico R, Spina D, VandeGinste D, et al. Autoencoding density-based anomaly detection for signal integrity applications[C] //Proc of the 27th IEEE Conf on Electrical Performance of Electronic Packaging and Systems (EPEPS). Piscataway, NJ: IEEE, 2018: 47−49

    [18]

    Zhang Chunkai, Chen Yingyang, Yin Ao. Anomaly subsequence detection with dynamic local density for time series[G] //LNCS 11707: Proc of the 30th Int Conf on Database and Expert Systems Applications. Berlin: Springer, 2019: 291−305

    [19]

    Dunn J C. A fuzzy relative of the ISODATA process, and its use in detecting compact well-separated clusters[J]. Journal of Cybernetics, 1973, 3(3): 32−57 doi: 10.1080/01969727308546046

    [20]

    Pedrycz W, Oliveira V. A development of fuzzy encoding and decoding through fuzzy clustering[J]. IEEE Transactions on Instrumentation & Measurement, 2008, 57(4): 829−837

    [21]

    Genolini C, Ecochard R, Benghezal M, et al. kmlShape: An efficient method to cluster longitudinal data (time-series) according to their shapes[J]. PLoS ONE, 2016, 11(6): 1−24

    [22]

    Likas A, Vlassis N, Verbeek J J. The global k-means clustering algorithm[J]. Pattern Recognition, 2003, 36(2): 451−461 doi: 10.1016/S0031-3203(02)00060-2

    [23]

    Ma Ruizhe, Filali S B, Angryk R. Time series distance density cluster with statistical preprocessing[C] //Proc of the 20th Int Conf on Big Data Analytics and Knowledge Discovery. Berlin: Springer, 2018: 371−381

    [24]

    Paparrizos J, Gravano L. K-shape: Efficient and accurate clustering of time series[C] //Proc of the 42nd ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2015: 1855−1870

    [25]

    Breunig M M , Kriegel H P , Ng R T , et al. LOF: Identifying density-based local outliers[C] //Proc of the 28th ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2000: 93−104

    [26]

    Berndt D J, Clifford J. Using dynamic time warping to find patterns in time series[C] //Proc of the Workshop on Knowledge Discovery in Databases. New York: ACM, 1994: 229−248

    [27]

    Cover T, Hart P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21−27 doi: 10.1109/TIT.1967.1053964

    [28]

    Pukelsheim F. The three sigma rule[J]. The American Statistician, 1994, 48(2): 88−91

    [29]

    Pérez-Ortiz M, Durán-Rosal A M, Gutiérrez P A, et al. On the use of evolutionary time series analysis for segmenting paleoclimate data[J]. Neurocomputing, 2019, 326/327: 3−14

    [30]

    Dau H A, Bagnall A, Kamgar K, et al. The UCR time series archive[J]. IEEE/CAA Journal of Automatica Sinica, 2019, 6(6): 1293−1305 doi: 10.1109/JAS.2019.1911747

    [31]

    Qian Pengjiang, Chung Fulai, Wang Shitong, et al. Fast graph-based relaxed clustering for large data sets using minimal enclosing ball[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2012, 42(3): 672−687 doi: 10.1109/TSMCB.2011.2172604

    [32]

    Wang Yangtao, Chen Lihui. K-MEAP: Multiple exemplars affinity propagation with specified K clusters[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(12): 2670−2682 doi: 10.1109/TNNLS.2015.2495268

图(8)  /  表(7)
计量
  • 文章访问数:  225
  • HTML全文浏览量:  39
  • PDF下载量:  149
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-10-26
  • 修回日期:  2022-06-06
  • 网络出版日期:  2023-02-10
  • 刊出日期:  2022-12-31

目录

    /

    返回文章
    返回