ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2021, Vol. 58 ›› Issue (5): 1021-1034.doi: 10.7544/issn1000-1239.2021.20200912

Special Issue: 2021人工智能安全与隐私保护技术专题

Previous Articles     Next Articles

A Malicious Code Static Detection Framework Based on Multi-Feature Ensemble Learning

Yang Wang, Gao Mingzhe, Jiang Ting   

  1. (School of Cyber Science and Engineering, Southeast University, Nanjing 211189) (Key Laboratory of Computer Network and Information Integration(Southeast University), Ministry of Education, Nanjing 211189) (Jiangsu Provincial Key Laboratory of Computer Network Technology (Southeast University), Nanjing 211189)
  • Online:2021-05-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (62072100).

Abstract: With the popularity of the Internet and the rapid development of 5G communication technology, the threats to cyberspace are increasing, especially the exponential increase in the number of malware and the explosive increase in the number of variants of their families. The traditional signature-based malware detection is too slow to handle the millions of new malwares emerged every day, while the false positive and false negative rates of general machine learning classifiers are significantly too high. At the same time malware packing, obfuscation and other adversarial techniques have caused more trouble to the situation. Based on this, we propose a static malware detection framework based on multi-feature ensemble learning. By extracting the non-PE (Portable Executable) structure feature, visible string feature, sink assembly code sequences feature, PE structure feature and function call relationship feature from the malware, we construct models matching each feature, and use Bagging and Stacking ensemble algorithms to reduce the risk of overfitting. Finally we adopt the weighted voting algorithm to further aggregate the output results of the ensemble model. The experimental results show the detection accuracy of multi-feature multi-model aggregation algorithm can reach 96.99%, which prove the method has better malware identification ability than other static detection methods, and higher recognition rate for malwares using packing or obfuscation techniques.

Key words: malicious code, multiple features, ensemble learning, policy voting, static detection

CLC Number: