ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2019, Vol. 56 ›› Issue (11): 2339-2348.doi: 10.7544/issn1000-1239.2019.20190393

所属专题: 2019密码学与智能安全研究专题

• 信息安全 • 上一篇    下一篇

一种基于概率主题模型的恶意代码特征提取方法

刘亚姝1,2,王志海1,侯跃然3,严寒冰4   

  1. 1(北京交通大学计算机与信息技术学院 北京 100044);2(北京建筑大学电气与信息工程学院 北京 100044);3(北京邮电大学网络技术研究院 北京 100876);4(国家计算机网络应急技术处理协调中心 北京 100029) (ly_s8020@163.com)
  • 出版日期: 2019-11-12
  • 基金资助: 
    国家重点研发计划项目(2018YFB0803604,2018YFB0804704);国家自然科学基金项目(U1736218,61672086)

A Method of Extracting Malware Features Based on Probabilistic Topic Model

Liu Yashu1,2, Wang Zhihai1, Hou Yueran3, Yan Hanbing4   

  1. 1(School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044);2(School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044);3(Institute of Network Technology, Beijing University of Posts and Telecommunications, Beijing 100876);4(National Computer Network Emergency Response Technical TeamCoordination Center of China, Beijing 100029)
  • Online: 2019-11-12

摘要: 在当前复杂网络环境下,恶意代码通过各种方式快速传播,入侵用户终端设备或网络设备、非法窃取用户隐私数据,对网络和互联网用户造成了严重的安全威胁.传统检测方法难以检测未知恶意代码,而恶意代码变体的多样性和庞大数量也对未知恶意代码检测构成了巨大挑战.提出了一种无监督的恶意代码识别方法,通过分析反汇编PE文件给出汇编指令标准化规则,结合潜在狄立克雷分布(latent Dirichlet allocation, LDA)获得汇编指令中潜在的“文档-主题”、“主题-词”的分布.再以“主题分布”构造恶意样本特征,产生一个全新的恶意代码检测框架.结合“困惑度”和变化的步长给出了最优“主题”数目的快速评价和自动确定方法,解决了LDA模型中主题数目需要预先指定的问题.同时解析了“文档-主题”、“主题-词”聚集结果的语义可解释性,说明了该方法获得的样本特征具有潜在的语义.实验结果表明:与其他方法相比该方法具有相当的或更好的恶意代码鉴别能力,同时能够准确地识别恶意代码的新变体.

关键词: 恶意代码检测, 狄立克雷分布, 概率主题模型, 困惑度, Gibbs

Abstract: In the current complex network environment, malicious codes have been spread quickly in various ways, which illegally occupy user terminal equipment or network equipment and illegally steal privacy data. Malware poses a serious security threat to network and Internet users. Traditional methods can’t detect unknown malicious codes which is challenged by the diversity and large number of malicious code variants. We propose an unsupervised malware identification approach that generates a standardization rule of assembly instructions by analyzing the content of the decompiled PE files. By introducing latent Dirichlet allocation (LDA), our method extracts the latent “document-topic” and “topic-word” probability allocation from samples. The topic probability distributions are extracted as features of samples, which is a new way for malware feature presentation. Then, we propose a new malware detecting framework to train model and test malware. What’s more, our method solves the problem that the topic number in LDA model needs to be specified beforehand using the perplexity and different steps, which evaluates the best numbers of “topics” quickly and automatically. Finally, it analyzes the semantics of “document-topic” and “topic-word” aggregating results in assembly instructions, which explains the latent semantics of features obtained by our method. Experimental results show that our method is more discriminative, which has better classification results than other methods, while providing accurate discrimination of the new novel malware variants.

Key words: malware detection, latent Dirichlet allocation (LDA), probabilistic topic model, perplexity, Gibbs

中图分类号: