ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2019, Vol. 56 ›› Issue (11): 2339-2348.doi: 10.7544/issn1000-1239.2019.20190393

Special Issue: 2019密码学与智能安全研究专题

Previous Articles     Next Articles

A Method of Extracting Malware Features Based on Probabilistic Topic Model

Liu Yashu1,2, Wang Zhihai1, Hou Yueran3, Yan Hanbing4   

  1. 1(School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044);2(School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044);3(Institute of Network Technology, Beijing University of Posts and Telecommunications, Beijing 100876);4(National Computer Network Emergency Response Technical TeamCoordination Center of China, Beijing 100029)
  • Online:2019-11-12

Abstract: In the current complex network environment, malicious codes have been spread quickly in various ways, which illegally occupy user terminal equipment or network equipment and illegally steal privacy data. Malware poses a serious security threat to network and Internet users. Traditional methods can’t detect unknown malicious codes which is challenged by the diversity and large number of malicious code variants. We propose an unsupervised malware identification approach that generates a standardization rule of assembly instructions by analyzing the content of the decompiled PE files. By introducing latent Dirichlet allocation (LDA), our method extracts the latent “document-topic” and “topic-word” probability allocation from samples. The topic probability distributions are extracted as features of samples, which is a new way for malware feature presentation. Then, we propose a new malware detecting framework to train model and test malware. What’s more, our method solves the problem that the topic number in LDA model needs to be specified beforehand using the perplexity and different steps, which evaluates the best numbers of “topics” quickly and automatically. Finally, it analyzes the semantics of “document-topic” and “topic-word” aggregating results in assembly instructions, which explains the latent semantics of features obtained by our method. Experimental results show that our method is more discriminative, which has better classification results than other methods, while providing accurate discrimination of the new novel malware variants.

Key words: malware detection, latent Dirichlet allocation (LDA), probabilistic topic model, perplexity, Gibbs

CLC Number: