A Method of Extracting Malware Features Based on Probabilistic Topic Model
-
摘要: 在当前复杂网络环境下,恶意代码通过各种方式快速传播,入侵用户终端设备或网络设备、非法窃取用户隐私数据,对网络和互联网用户造成了严重的安全威胁.传统检测方法难以检测未知恶意代码,而恶意代码变体的多样性和庞大数量也对未知恶意代码检测构成了巨大挑战.提出了一种无监督的恶意代码识别方法,通过分析反汇编PE文件给出汇编指令标准化规则,结合潜在狄立克雷分布(latent Dirichlet allocation, LDA)获得汇编指令中潜在的“文档-主题”、“主题-词”的分布.再以“主题分布”构造恶意样本特征,产生一个全新的恶意代码检测框架.结合“困惑度”和变化的步长给出了最优“主题”数目的快速评价和自动确定方法,解决了LDA模型中主题数目需要预先指定的问题.同时解析了“文档-主题”、“主题-词”聚集结果的语义可解释性,说明了该方法获得的样本特征具有潜在的语义.实验结果表明:与其他方法相比该方法具有相当的或更好的恶意代码鉴别能力,同时能够准确地识别恶意代码的新变体.Abstract: In the current complex network environment, malicious codes have been spread quickly in various ways, which illegally occupy user terminal equipment or network equipment and illegally steal privacy data. Malware poses a serious security threat to network and Internet users. Traditional methods can’t detect unknown malicious codes which is challenged by the diversity and large number of malicious code variants. We propose an unsupervised malware identification approach that generates a standardization rule of assembly instructions by analyzing the content of the decompiled PE files. By introducing latent Dirichlet allocation (LDA), our method extracts the latent “document-topic” and “topic-word” probability allocation from samples. The topic probability distributions are extracted as features of samples, which is a new way for malware feature presentation. Then, we propose a new malware detecting framework to train model and test malware. What’s more, our method solves the problem that the topic number in LDA model needs to be specified beforehand using the perplexity and different steps, which evaluates the best numbers of “topics” quickly and automatically. Finally, it analyzes the semantics of “document-topic” and “topic-word” aggregating results in assembly instructions, which explains the latent semantics of features obtained by our method. Experimental results show that our method is more discriminative, which has better classification results than other methods, while providing accurate discrimination of the new novel malware variants.
-
-
期刊类型引用(8)
1. 李晶,贾园园,张磊. MuSig多重签名的实用拜占庭容错共识算法. 计算机应用研究. 2025(02): 352-356 . 百度学术
2. 时小虎,姚鑫,孙延风,马德印. 基于贡献度和数据有效性检验的共识机制. 东北大学学报(自然科学版). 2024(02): 160-169+178 . 百度学术
3. 万林. 基于区块链技术的P2P网络分布式数字签名系统设计. 安徽水利水电职业技术学院学报. 2024(03): 43-48 . 百度学术
4. 唐淑敏,金瑜. 区块链中基于中国剩余定理投票方案的共识机制. 计算机应用. 2023(02): 458-466 . 百度学术
5. 张宝,田有亮,高胜. 基于博弈论抗共谋攻击的全局随机化共识算法. 网络与信息安全学报. 2022(04): 98-109 . 百度学术
6. 刘恒飞,张毅. 区块链技术及其应用. 福建电脑. 2021(01): 174-175 . 百度学术
7. 李杰,李雷孝,孔冬冬. 一种基于中文助记词的椭圆曲线密钥生成方案. 内蒙古工业大学学报(自然科学版). 2020(02): 128-135 . 百度学术
8. 张彭奕,宋杰. 区块链共识算法效能优化研究进展. 计算机科学. 2020(12): 296-303 . 百度学术
其他类型引用(9)
计量
- 文章访问数: 1089
- HTML全文浏览量: 7
- PDF下载量: 673
- 被引次数: 17