Citation: | Sun Huaqi, Kang Fei, Shu Hui, Huang Yuyao, Bu Wenjuan. Binary Code Modularization Method Based on Graph Embedding[J]. Journal of Computer Research and Development, 2024, 61(9): 2275-2289. DOI: 10.7544/issn1000-1239.202330337 |
Reverse analysis as a key technology plays a vital role in cyber security. It helps analysts gain insight into the behavior of software and vulnerabilities detection, in order to effectively prevent attacks. The growing software scale and complexity urge some research to break down software into modules for rapid analysis via structural and functional information using community discovery algorithms. However, these studies just regard software as a social network consisting of simple nodes and edges missing valuable attribute information. We notice that the contribution of different features to the modular structure of the program is different and varies from samples. Inspired by the innovative application of graph embedding technologies in program analysis, we propose a binary code modularization method called GEBCM. The method transforms an executable program into an attributed graph, and employs graph embedding clustering methods with attention mechanism and ranking mechanism to embed representations and cluster function nodes. The result clusters group binaries into independent parts with more complete functions, revealing the semantic information of complex program structures. Experimental results show that GEBCM outperforms other modularization tools by revealing the original modular layout with an average of 10.2% higher F1 score. Additionally, in the new task of malware decomposition, GEBCM also exhibits better accuracy.
[1] |
Statista. Number of available applications in the Google play store from December 2009 to March 2023 [EB/OL]. [2023-04-17]. https: //www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store
|
[2] |
Mitre. Common vulnerabilities and exposures [EB/OL]. [2023-03-04].https://cve.mitre.org
|
[3] |
AV-TEST. Malware [EB/OL]. [2023-04-17]. https://www.av-test.org/en/statistics/malware
|
[4] |
Schwartz E J, Avgerinos T, Brumley D. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask) [C] //Proc of the 2010 IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2010: 317−331
|
[5] |
D’Elia D C, Coppa E, Nicchi S, et al. SoK: Using dynamic binary instrumentation for security (and how you may get caught red handed) [C]//Proc of the 2019 ACM Asia Conf on Computer and Communications Security. New York: ACM, 2019: 15−27
|
[6] |
Chen Jinyin, Hu Keke, Yu Yue, et al. Software visualization and deep transfer learning for effective software defect prediction [C] //Proc of the 42nd ACM/IEEE Int Conf on Software Engineering. Los Alamitos, CA: IEEE Computer Society, 2020: 578−589
|
[7] |
Yasunaga M, Liang P. Graph-based, self-supervised program repair from diagnostic feedback [C] //Proc of the 37th Int Conf on Machine Learning. New York: ACM, 2020: 10799−10808
|
[8] |
Gibert D, Mateu C, Planes J. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges[J]. Journal of Network and Computer Applications, 2020, 153: 102526 doi: 10.1016/j.jnca.2019.102526
|
[9] |
Bavota G, Oliveto R, Gethers M, et al. Methodbook: Recommending move method refactorings via relational topic models[J]. IEEE Transactions on Software Engineering, 2014, 40(7): 671−694 doi: 10.1109/TSE.2013.60
|
[10] |
Sarhan Q I, Ahmed B S, Bures M, et al. Software module clustering: An in-depth literature analysis[J]. IEEE Transactions on Software Engineering, 2022, 48(6): 1905−1928 doi: 10.1109/TSE.2020.3042553
|
[11] |
Zhou Yang, Cheng Hong, Yu J X. Graph clustering based on structural/attribute similarities[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 718−729 doi: 10.14778/1687627.1687709
|
[12] |
Karande V, Chandra S, Lin Z, et al. BCD: Decomposing binary code into components using graph-based clustering [C] //Proc of the 2018 on Asia Conf on Computer and Communications Security. New York: ACM, 2018: 393−398
|
[13] |
Yang Can, Xu Zhengzi, Chen Hongxu, et al. ModX: Binary level partially imported third-party library detection via program modularization and semantic matching [C] //Proc of the 44th Int Conf on Software Engineering. New York: ACM, 2022: 1393−1405
|
[14] |
Xia Hong, Zhang Yongkang, Chen Yanping, et al. Software module clustering using the hierarchical clustering combination method [C] //Proc of the 7th Int Conf on Cloud Computing and Big Data Analytics. Piscataway, NJ: IEEE, 2022: 155−160
|
[15] |
Papachristou M. Software clusterings with vector semantics and the call graph [C] //Proc of the 27th ACM Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software Engineering (ESEC/FSE). New York: ACM, 2019: 1184−1186
|
[16] |
Pan Weifeng, Song Beibei, Li Kangshun, et al. Identifying key classes in object-oriented software using generalized k-core decomposition[J]. Future Generations Computer Systems, 2018, 81: 188−202 doi: 10.1016/j.future.2017.10.006
|
[17] |
Newman M E J. Fast algorithm for detecting community structure in network [J]. arXiv preprint, arXiv: cond-mat/0309508, 2003
|
[18] |
Newman M E J, Girvan M. Finding and evaluating community structure in networks [J]. arXiv preprint, arXiv: cond-mat/ 0308217, 2003
|
[19] |
Candela I, Bavota G, Russo B, et al. Using cohesion and coupling for software remodularization: Is it enough[J]. ACM Transactions on Software Engineering & Methodology, 2016, 25(24): 1−28
|
[20] |
Huang Yuyao, Shu Hui, Kang Fei. DeMal: Module decomposition of malware based on community discovery[J]. Computers & Security, 2022, 117: 102680
|
[21] |
Lian Wen, Kirk D, Dromey R G. Software systems as complex networks [C] //Proc of the 6th IEEE Int Conf on Cognitive Informatics. Piscataway, NJ: IEEE, 2007: 106−115
|
[22] |
Li Hao, Wang Tian, Xu Xinxin, et al. Modeling software systems as complex networks: Analysis and their applications [J]. Mathematical Problems in Engineering, 2020, 2020: 1−7
|
[23] |
Nielson H R. Principles of Program Analysis [M]. Berlin: Springer, 1999
|
[24] |
Hey-rays. IDA Pro-A powerful disassembler and a versatile debugger [EB/OL]. [2023-04-17].https://www.hex-rays.com/ida-pro
|
[25] |
Radare2. Libre reversing framework for Unix geeks [EB/OL]. [2023-04-17].https://github.com/radareorg/radare2
|
[26] |
NSA. A software reverse engineering suite of tools developed by, NSA’s research directorate in support of the cybersecurity mission [EB/OL]. [2023-04-17].https://github.com/NationalSecurityAgency/ghidra/releases
|
[27] |
Henderson A, Prakash A, Yan L, et al. Make it work, make it right, make it fast: Building a platform-neutral whole-system dynamic binary analysis platform [C] //Proc of the 2014 Int Symp on Software Testing and Analysis. Piscataway, NJ: IEEE, 2014: 248−258
|
[28] |
Song D, Brumley D, Yin Heng, et al. BitBlaze: A new approach to computer security via binary analysis [C] //Proc of the 4th Int Conf on Information Systems Security. Berlin: Springer, 2008: 1−25
|
[29] |
Yan S, Wang Ruoyu, Salls C, et al. SOK: (State of) The art of war: Offensive techniques in binary analysis [C] //Proc of the 37th IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2016: 138−157
|
[30] |
Cai Hongyun, Zheng V W, Chang C C. A comprehensive survey of graph embedding: Problems, techniques and applications[J]. IEEE Transactions on Knowledge & Data Engineering, 2018, 30(9): 1616−1637
|
[31] |
Wang Daixin, Cui Peng, Zhu Wenwu. Structural deep network embedding [C] //Proc of the 22nd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2016: 1225−1234
|
[32] |
Cao Shaosheng, Lu Wei, Xu Qiongkai. Deep neural networks for learning graph representations [C] //Proc of the 30th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2016: 1145−1152
|
[33] |
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks [J]. arXiv preprint, arXiv: 1609.02907, 2016
|
[34] |
Velikovi P, Cucurull G, Casanova A, et al. Graph attention networks [J]. arXiv preprint, arXiv: 1710.10903, 2017
|
[35] |
Wang Chun, Pan Shirui, Hu Ruiqi, et al. Attributed graph clustering: A deep attentional embedding approach [C] //Proc of the 28th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2019: 3670−3676
|
[36] |
Intel. Pin-A dynamic binary instrumentation tool [EB/OL]. [2023-04-07].https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html
|
[37] |
Levine J R. Linkers and Loaders [M]. San Francisco, CA: Morgan Kaufmann, 1999
|
[38] |
Hu Jie, Li Shen, Samuel A, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011−2023 doi: 10.1109/TPAMI.2019.2913372
|
[39] |
Garey M R, Johnson D S. Computers and Intractability: A Guide to the Theory of NP-Completeness [M]. New York: W. H. Freeman and Company, 1979
|
[40] |
Github. Dexter-POSGrabber [EB/OL]. [2023-04-17].https://github.com/whobin/Dexter-POSGrabber
|
[41] |
Xie Junyuan, Girshick R, Farhadi A. Unsupervised deep embedding for clustering analysis [C] //Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 478−487
|
[42] |
Xiang Shunhua, Nie Feiping, Zhang Changshui. Learning a Mahalanobis distance metric for data clustering and classification[J]. Pattern Recognition, 2008, 41(12): 3600−3612 doi: 10.1016/j.patcog.2008.05.018
|
[43] |
Rokon M, Islam R, Darki A, et al. SourceFinder: Finding malware source-code from publicly available repositories [C] //Proc of the 23rd Int Symp on Research in Attacks, Intrusions and Defenses (RAID 2020). Berkeley, CA: USENIX Association, 2020: 149−163
|
[44] |
Fahad A, Alshatri N, Tari Z, et al. A survey of clustering algorithms for big data: Taxonomy and empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014, 2(3): 267−279 doi: 10.1109/TETC.2014.2330519
|
[45] |
Kuhn H W. The Hungarian method for the assignment problem[J]. Naval Research Logistics, 2010, 52(1/2): 7−21
|
[46] |
Snoek J, Larochelle H, Adams R P. Practical Bayesian optimization of machine learning algorithms [C] //Proc of the 25th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2012: 2951−2959
|
[47] |
Xiao Xueli, Yan Ming, Basodi S, et al. Efficient hyperparameter optimization in deep learning using a variable length genetic algorithm [J]. arXiv preprint, arXiv: 2006.12703, 2020
|
[48] |
Maclaurin D, Duvenaud D, Adams R P. Gradient-based hyperparameter optimization through reversible learning [J]. arXiv preprint, arXiv: 1502.03492, 2015
|
[1] | Cao Yiran, Zhu Youwen, He Xingyu, Zhang Yue. Utility-Optimized Local Differential Privacy Set-Valued Data Frequency Estimation Mechanism[J]. Journal of Computer Research and Development, 2022, 59(10): 2261-2274. DOI: 10.7544/issn1000-1239.20220504 |
[2] | Hong Jinxin, Wu Yingjie, Cai Jianping, Sun Lan. Differentially Private High-Dimensional Binary Data Publication via Attribute Segmentation[J]. Journal of Computer Research and Development, 2022, 59(1): 182-196. DOI: 10.7544/issn1000-1239.20200701 |
[3] | Wu Wanqing, Zhao Yongxin, Wang Qiao, Di Chaofan. A Safe Storage and Release Method of Trajectory Data Satisfying Differential Privacy[J]. Journal of Computer Research and Development, 2021, 58(11): 2430-2443. DOI: 10.7544/issn1000-1239.2021.20210589 |
[4] | Zhang Yuxuan, Wei Jianghong, Li Ji, Liu Wenfen, Hu Xuexian. Graph Degree Histogram Publication Method with Node-Differential Privacy[J]. Journal of Computer Research and Development, 2019, 56(3): 508-520. DOI: 10.7544/issn1000-1239.2019.20170886 |
[5] | Zhu Weijun, You Qingguang, Yang Weidong, Zhou Qinglei. Trajectory Privacy Preserving Based on Statistical Differential Privacy[J]. Journal of Computer Research and Development, 2017, 54(12): 2825-2832. DOI: 10.7544/issn1000-1239.2017.20160647 |
[6] | Wu Yingjie, Zhang Liqun, Kang Jian, Wang Yilei. An Algorithm for Differential Privacy Streaming Data Adaptive Publication[J]. Journal of Computer Research and Development, 2017, 54(12): 2805-2817. DOI: 10.7544/issn1000-1239.2017.20160555 |
[7] | Wang Liang, Wang Weiping, Meng Dan. Privacy Preserving Data Publishing via Weighted Bayesian Networks[J]. Journal of Computer Research and Development, 2016, 53(10): 2343-2353. DOI: 10.7544/issn1000-1239.2016.20160465 |
[8] | Lu Guoqing, Zhang Xiaojian, Ding Liping, Li Yanfeng, Liao Xin. Frequent Sequential Pattern Mining under Differential Privacy[J]. Journal of Computer Research and Development, 2015, 52(12): 2789-2801. DOI: 10.7544/issn1000-1239.2015.20140516 |
[9] | Ouyang Jia, Yin Jian, Liu Shaopeng, Liu Yubao. An Effective Differential Privacy Transaction Data Publication Strategy[J]. Journal of Computer Research and Development, 2014, 51(10): 2195-2205. DOI: 10.7544/issn1000-1239.2014.20130824 |
[10] | Ni Weiwei, Chen Geng, Chong Zhihong, Wu Yingjie. Privacy-Preserving Data Publication for Clustering[J]. Journal of Computer Research and Development, 2012, 49(5): 1095-1104. |