基于图嵌入的二进制代码模块划分方法

孙华启; 康绯; 舒辉; 黄宇垚; 卜文娟

doi:10.7544/issn1000-1239.202330337

基于图嵌入的二进制代码模块划分方法

Binary Code Modularization Method Based on Graph Embedding

摘要

摘要: 软件逆向分析作为网络空间安全领域研究的核心支撑技术，在软件漏洞分析、恶意代码行为分析等方面有着广泛应用. 二进制代码的模块划分是该领域研究中的关键问题，通过将复杂或者大体量软件合理划分为若干模块，对于帮助分析者快速、准确理解软件结构与功能，提高分析效率起着重要作用. 对此，常见方法是将代码中的函数及其调用关系看作复杂网络，通过社区发现算法来进行函数聚类，实现模块划分，该类方法通常只考虑节点之间的连接关系，忽略了节点的属性信息、节点之间的相似度等信息，且对噪声和异常值比较敏感. 为了解决这些问题，提出了一种基于图嵌入的二进制代码模块划分（graph embedding based binary code modularization，GEBCM）方法. 该方法首先将软件系统抽象为属性图，然后通过带有注意力和排名机制的图嵌入聚类方法对函数节点进行嵌入表示并聚类. 通过聚类将二进制文件分组为具有更完整功能的独立部分，揭示了复杂程序结构中分离的模块语义信息. 在2个数据集上进行的实验评估，验证了所提出的GEBCM方法的有效性. 评估结果表明，相比其他二进制模块化工作，GEBCM平均提高10.2%的F1值. 此外，在针对恶意样本的评估实验中，GEBCM能有效地划分出恶意代码的模块，表现出优秀的可扩展性.

Abstract: Reverse analysis as a key technology plays a vital role in cyber security. It helps analysts gain insight into the behavior of software and vulnerabilities detection, in order to effectively prevent attacks. The growing software scale and complexity urge some research to break down software into modules for rapid analysis via structural and functional information using community discovery algorithms. However, these studies just regard software as a social network consisting of simple nodes and edges missing valuable attribute information. We notice that the contribution of different features to the modular structure of the program is different and varies from samples. Inspired by the innovative application of graph embedding technologies in program analysis, we propose a binary code modularization method called GEBCM. The method transforms an executable program into an attributed graph, and employs graph embedding clustering methods with attention and ranking mechanisms to embed representations and cluster function nodes. The result clusters group binaries into independent parts with more complete functions, revealing the semantic information of complex program structures. Experimental results show that GEBCM outperforms other modularization tools by revealing the original modular layout with an average of 10.2% higher F1 score. Additionally, in the new task of malware decomposition, GEBCM also exhibits better accuracy.

HTML全文

参考文献(48)

施引文献

资源附件(0)