ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (4): 778-790.doi: 10.7544/issn1000-1239.2020.20190875

Special Issue: 2020数据驱动网络专题

Previous Articles     Next Articles

Unified Anomaly Detection for Syntactically Diverse Logs in Cloud Datacenter

Zhang Shenglin1, Li Dongwen1, Sun Yongqian1, Meng Weibin2,3,4, Zhang Yuzhe1, Zhang Yuzhi1, Liu Ying3,4, Pei Dan2,4   

  1. 1(College of Software, Nankai University, Tianjin 300350);2(Department of Computer Science and Technology, Tsinghua University, Beijing 100084);3(Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing 100084);4(Beijing National Research Center for Information Science and Technology, Beijing 100084)
  • Online:2020-04-01
  • Supported by: 
    This work was supported by the National Key Research and Development Plan of China (2018YFB0204304).

Abstract: Benefit from the rapid development of natural language processing and machine learning methods, log based automatic anomaly detection is becoming increasingly popular for the software and hardware systems in cloud datacenters. Current unsupervised learning methods, requiring no labelled anomalies, still need to obtain a large number of normal logs and generally suffer from low accuracy. Although current supervised learning methods are accurate, they need much labelling efforts. This is because the syntax of different types of logs generated by different software/hardware systems varies greatly, and thus for each type of logs, supervised methods need sufficient anomaly labels to train its corresponding anomaly detection model. Meanwhile, different types of logs usually have the same or similar semantics when anomalies occur. In this paper, we propose LogMerge, which learns the semantic similarity among different types of logs and then transfers anomaly patterns across these logs. In this way, labelling efforts are reduced significantly. LogMerge employs a word embedding method to construct the vectors of words and templates, and then utilizes a clustering technique to group templates based on semantics, addressing the challenge that different types of logs are different in syntax. In addition, LogMerge combines CNN and LSTM to build an anomaly detection model, which not only effectively extracts the sequential feature of logs, but also minimizes the impact of noises in logs. We have conducted extensive experiments on publicly available datasets, which demonstrates that compared with the current supervised/unsupervised learning methods, LogMerge achieves higher accuracy. Moreover, LogMerge achieves high accuracy when there are few anomaly labels in the target type of logs, which therefore significantly reduces labelling efforts.

Key words: syslog, anomaly detection, cloud datacenter, word embedding, CNN, LSTM

CLC Number: