一种基于概率主题模型的恶意代码特征提取方法

刘亚姝; 王志海; 侯跃然; 严寒冰

doi:10.7544/issn1000-1239.2019.20190393

一种基于概率主题模型的恶意代码特征提取方法

A Method of Extracting Malware Features Based on Probabilistic Topic Model

摘要

摘要: 在当前复杂网络环境下，恶意代码通过各种方式快速传播，入侵用户终端设备或网络设备、非法窃取用户隐私数据，对网络和互联网用户造成了严重的安全威胁.传统检测方法难以检测未知恶意代码，而恶意代码变体的多样性和庞大数量也对未知恶意代码检测构成了巨大挑战.提出了一种无监督的恶意代码识别方法，通过分析反汇编PE文件给出汇编指令标准化规则，结合潜在狄立克雷分布(latent Dirichlet allocation, LDA)获得汇编指令中潜在的“文档-主题”、“主题-词”的分布.再以“主题分布”构造恶意样本特征，产生一个全新的恶意代码检测框架.结合“困惑度”和变化的步长给出了最优“主题”数目的快速评价和自动确定方法，解决了LDA模型中主题数目需要预先指定的问题.同时解析了“文档-主题”、“主题-词”聚集结果的语义可解释性，说明了该方法获得的样本特征具有潜在的语义.实验结果表明：与其他方法相比该方法具有相当的或更好的恶意代码鉴别能力，同时能够准确地识别恶意代码的新变体.

Abstract: In the current complex network environment, malicious codes have been spread quickly in various ways, which illegally occupy user terminal equipment or network equipment and illegally steal privacy data. Malware poses a serious security threat to network and Internet users. Traditional methods can’t detect unknown malicious codes which is challenged by the diversity and large number of malicious code variants. We propose an unsupervised malware identification approach that generates a standardization rule of assembly instructions by analyzing the content of the decompiled PE files. By introducing latent Dirichlet allocation (LDA), our method extracts the latent “document-topic” and “topic-word” probability allocation from samples. The topic probability distributions are extracted as features of samples, which is a new way for malware feature presentation. Then, we propose a new malware detecting framework to train model and test malware. What’s more, our method solves the problem that the topic number in LDA model needs to be specified beforehand using the perplexity and different steps, which evaluates the best numbers of “topics” quickly and automatically. Finally, it analyzes the semantics of “document-topic” and “topic-word” aggregating results in assembly instructions, which explains the latent semantics of features obtained by our method. Experimental results show that our method is more discriminative, which has better classification results than other methods, while providing accurate discrimination of the new novel malware variants.

HTML全文

参考文献(0)

施引文献

资源附件(0)