Binary Code Modularization Method Based on Graph Embedding

Sun Huaqi; Kang Fei; Shu Hui; Huang Yuyao; Bu Wenjuan

doi:10.7544/issn1000-1239.202330337

Journal of Computer Research and Development > 2024 > 61(9): 2275-2289. > DOI: 10.7544/issn1000-1239.202330337 CSTR: 32373.14.issn1000-1239.202330337

Sun Huaqi, Kang Fei, Shu Hui, Huang Yuyao, Bu Wenjuan. Binary Code Modularization Method Based on Graph Embedding[J]. Journal of Computer Research and Development, 2024, 61(9): 2275-2289. DOI: 10.7544/issn1000-1239.202330337

Citation:

PDF (2590 KB)

Binary Code Modularization Method Based on Graph Embedding

PLA Strategic Support Force Information Engineering University, Zhengzhou 450001

More Information

Author Bio:
Sun Huaqi: born in 1994. Master. His main research interests include reverse engineering and program analysis

Kang Fei: born in 1972. PhD, professor. Her main research interests include cyberspace security, reverse engineering, and software protection

Shu Hui: born in 1974. PhD, professor, PhD supervisor. His main research interests include cyberspace security, program analysis, and reverse engineering

Huang Yuyao: born in 1996. PhD candidate. His main research interests include network protocol reverse-engineering and program analysis

Bu Wenjuan: born in 1984. PhD candidate. Her main research interests include cyberspace security and reverse engineering
Received Date: April 23, 2023
Revised Date: October 15, 2023
Available Online: April 27, 2024

Graphical Abstract

Abstract

Abstract

Reverse analysis as a key technology plays a vital role in cyber security. It helps analysts gain insight into the behavior of software and vulnerabilities detection, in order to effectively prevent attacks. The growing software scale and complexity urge some research to break down software into modules for rapid analysis via structural and functional information using community discovery algorithms. However, these studies just regard software as a social network consisting of simple nodes and edges missing valuable attribute information. We notice that the contribution of different features to the modular structure of the program is different and varies from samples. Inspired by the innovative application of graph embedding technologies in program analysis, we propose a binary code modularization method called GEBCM. The method transforms an executable program into an attributed graph, and employs graph embedding clustering methods with attention mechanism and ranking mechanism to embed representations and cluster function nodes. The result clusters group binaries into independent parts with more complete functions, revealing the semantic information of complex program structures. Experimental results show that GEBCM outperforms other modularization tools by revealing the original modular layout with an average of 10.2% higher F1 score. Additionally, in the new task of malware decomposition, GEBCM also exhibits better accuracy.
- clustering,
- graph embedding,
- program analysis,
- modularization,
- reverse engineering

FullText(HTML)

References (48)

References

[1]	Statista. Number of available applications in the Google play store from December 2009 to March 2023 [EB/OL]. [2023-04-17]. https: //www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store
[2]	Mitre. Common vulnerabilities and exposures [EB/OL]. [2023-03-04].https://cve.mitre.org
[3]	AV-TEST. Malware [EB/OL]. [2023-04-17]. https://www.av-test.org/en/statistics/malware
[4]	Schwartz E J, Avgerinos T, Brumley D. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask) [C] //Proc of the 2010 IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2010: 317−331
[5]	D’Elia D C, Coppa E, Nicchi S, et al. SoK: Using dynamic binary instrumentation for security (and how you may get caught red handed) [C]//Proc of the 2019 ACM Asia Conf on Computer and Communications Security. New York: ACM, 2019: 15−27
[6]	Chen Jinyin, Hu Keke, Yu Yue, et al. Software visualization and deep transfer learning for effective software defect prediction [C] //Proc of the 42nd ACM/IEEE Int Conf on Software Engineering. Los Alamitos, CA: IEEE Computer Society, 2020: 578−589
[7]	Yasunaga M, Liang P. Graph-based, self-supervised program repair from diagnostic feedback [C] //Proc of the 37th Int Conf on Machine Learning. New York: ACM, 2020: 10799−10808
[8]	Gibert D, Mateu C, Planes J. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges[J]. Journal of Network and Computer Applications, 2020, 153: 102526 doi: 10.1016/j.jnca.2019.102526
[9]	Bavota G, Oliveto R, Gethers M, et al. Methodbook: Recommending move method refactorings via relational topic models[J]. IEEE Transactions on Software Engineering, 2014, 40(7): 671−694 doi: 10.1109/TSE.2013.60
[10]	Sarhan Q I, Ahmed B S, Bures M, et al. Software module clustering: An in-depth literature analysis[J]. IEEE Transactions on Software Engineering, 2022, 48(6): 1905−1928 doi: 10.1109/TSE.2020.3042553
[11]	Zhou Yang, Cheng Hong, Yu J X. Graph clustering based on structural/attribute similarities[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 718−729 doi: 10.14778/1687627.1687709
[12]	Karande V, Chandra S, Lin Z, et al. BCD: Decomposing binary code into components using graph-based clustering [C] //Proc of the 2018 on Asia Conf on Computer and Communications Security. New York: ACM, 2018: 393−398
[13]	Yang Can, Xu Zhengzi, Chen Hongxu, et al. ModX: Binary level partially imported third-party library detection via program modularization and semantic matching [C] //Proc of the 44th Int Conf on Software Engineering. New York: ACM, 2022: 1393−1405
[14]	Xia Hong, Zhang Yongkang, Chen Yanping, et al. Software module clustering using the hierarchical clustering combination method [C] //Proc of the 7th Int Conf on Cloud Computing and Big Data Analytics. Piscataway, NJ: IEEE, 2022: 155−160
[15]	Papachristou M. Software clusterings with vector semantics and the call graph [C] //Proc of the 27th ACM Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software Engineering (ESEC/FSE). New York: ACM, 2019: 1184−1186
[16]	Pan Weifeng, Song Beibei, Li Kangshun, et al. Identifying key classes in object-oriented software using generalized k-core decomposition[J]. Future Generations Computer Systems, 2018, 81: 188−202 doi: 10.1016/j.future.2017.10.006
[17]	Newman M E J. Fast algorithm for detecting community structure in network [J]. arXiv preprint, arXiv: cond-mat/0309508, 2003
[18]	Newman M E J, Girvan M. Finding and evaluating community structure in networks [J]. arXiv preprint, arXiv: cond-mat/ 0308217, 2003
[19]	Candela I, Bavota G, Russo B, et al. Using cohesion and coupling for software remodularization: Is it enough[J]. ACM Transactions on Software Engineering & Methodology, 2016, 25(24): 1−28
[20]	Huang Yuyao, Shu Hui, Kang Fei. DeMal: Module decomposition of malware based on community discovery[J]. Computers & Security, 2022, 117: 102680
[21]	Lian Wen, Kirk D, Dromey R G. Software systems as complex networks [C] //Proc of the 6th IEEE Int Conf on Cognitive Informatics. Piscataway, NJ: IEEE, 2007: 106−115
[22]	Li Hao, Wang Tian, Xu Xinxin, et al. Modeling software systems as complex networks: Analysis and their applications [J]. Mathematical Problems in Engineering, 2020, 2020: 1−7
[23]	Nielson H R. Principles of Program Analysis [M]. Berlin: Springer, 1999
[24]	Hey-rays. IDA Pro-A powerful disassembler and a versatile debugger [EB/OL]. [2023-04-17].https://www.hex-rays.com/ida-pro
[25]	Radare2. Libre reversing framework for Unix geeks [EB/OL]. [2023-04-17].https://github.com/radareorg/radare2
[26]	NSA. A software reverse engineering suite of tools developed by, NSA’s research directorate in support of the cybersecurity mission [EB/OL]. [2023-04-17].https://github.com/NationalSecurityAgency/ghidra/releases
[27]	Henderson A, Prakash A, Yan L, et al. Make it work, make it right, make it fast: Building a platform-neutral whole-system dynamic binary analysis platform [C] //Proc of the 2014 Int Symp on Software Testing and Analysis. Piscataway, NJ: IEEE, 2014: 248−258
[28]	Song D, Brumley D, Yin Heng, et al. BitBlaze: A new approach to computer security via binary analysis [C] //Proc of the 4th Int Conf on Information Systems Security. Berlin: Springer, 2008: 1−25
[29]	Yan S, Wang Ruoyu, Salls C, et al. SOK: (State of) The art of war: Offensive techniques in binary analysis [C] //Proc of the 37th IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2016: 138−157
[30]	Cai Hongyun, Zheng V W, Chang C C. A comprehensive survey of graph embedding: Problems, techniques and applications[J]. IEEE Transactions on Knowledge & Data Engineering, 2018, 30(9): 1616−1637
[31]	Wang Daixin, Cui Peng, Zhu Wenwu. Structural deep network embedding [C] //Proc of the 22nd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2016: 1225−1234
[32]	Cao Shaosheng, Lu Wei, Xu Qiongkai. Deep neural networks for learning graph representations [C] //Proc of the 30th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2016: 1145−1152
[33]	Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks [J]. arXiv preprint, arXiv: 1609.02907, 2016
[34]	Velikovi P, Cucurull G, Casanova A, et al. Graph attention networks [J]. arXiv preprint, arXiv: 1710.10903, 2017
[35]	Wang Chun, Pan Shirui, Hu Ruiqi, et al. Attributed graph clustering: A deep attentional embedding approach [C] //Proc of the 28th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2019: 3670−3676
[36]	Intel. Pin-A dynamic binary instrumentation tool [EB/OL]. [2023-04-07].https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html
[37]	Levine J R. Linkers and Loaders [M]. San Francisco, CA: Morgan Kaufmann, 1999
[38]	Hu Jie, Li Shen, Samuel A, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011−2023 doi: 10.1109/TPAMI.2019.2913372
[39]	Garey M R, Johnson D S. Computers and Intractability: A Guide to the Theory of NP-Completeness [M]. New York: W. H. Freeman and Company, 1979
[40]	Github. Dexter-POSGrabber [EB/OL]. [2023-04-17].https://github.com/whobin/Dexter-POSGrabber
[41]	Xie Junyuan, Girshick R, Farhadi A. Unsupervised deep embedding for clustering analysis [C] //Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 478−487
[42]	Xiang Shunhua, Nie Feiping, Zhang Changshui. Learning a Mahalanobis distance metric for data clustering and classification[J]. Pattern Recognition, 2008, 41(12): 3600−3612 doi: 10.1016/j.patcog.2008.05.018
[43]	Rokon M, Islam R, Darki A, et al. SourceFinder: Finding malware source-code from publicly available repositories [C] //Proc of the 23rd Int Symp on Research in Attacks, Intrusions and Defenses (RAID 2020). Berkeley, CA: USENIX Association, 2020: 149−163
[44]	Fahad A, Alshatri N, Tari Z, et al. A survey of clustering algorithms for big data: Taxonomy and empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014, 2(3): 267−279 doi: 10.1109/TETC.2014.2330519
[45]	Kuhn H W. The Hungarian method for the assignment problem[J]. Naval Research Logistics, 2010, 52(1/2): 7−21
[46]	Snoek J, Larochelle H, Adams R P. Practical Bayesian optimization of machine learning algorithms [C] //Proc of the 25th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2012: 2951−2959
[47]	Xiao Xueli, Yan Ming, Basodi S, et al. Efficient hyperparameter optimization in deep learning using a variable length genetic algorithm [J]. arXiv preprint, arXiv: 2006.12703, 2020
[48]	Maclaurin D, Duvenaud D, Adams R P. Gradient-based hyperparameter optimization through reversible learning [J]. arXiv preprint, arXiv: 1502.03492, 2015

[1]	Cao Yiran, Zhu Youwen, He Xingyu, Zhang Yue. Utility-Optimized Local Differential Privacy Set-Valued Data Frequency Estimation Mechanism[J]. Journal of Computer Research and Development, 2022, 59(10): 2261-2274. DOI: 10.7544/issn1000-1239.20220504
[2]	Hong Jinxin, Wu Yingjie, Cai Jianping, Sun Lan. Differentially Private High-Dimensional Binary Data Publication via Attribute Segmentation[J]. Journal of Computer Research and Development, 2022, 59(1): 182-196. DOI: 10.7544/issn1000-1239.20200701
[3]	Wu Wanqing, Zhao Yongxin, Wang Qiao, Di Chaofan. A Safe Storage and Release Method of Trajectory Data Satisfying Differential Privacy[J]. Journal of Computer Research and Development, 2021, 58(11): 2430-2443. DOI: 10.7544/issn1000-1239.2021.20210589
[4]	Zhang Yuxuan, Wei Jianghong, Li Ji, Liu Wenfen, Hu Xuexian. Graph Degree Histogram Publication Method with Node-Differential Privacy[J]. Journal of Computer Research and Development, 2019, 56(3): 508-520. DOI: 10.7544/issn1000-1239.2019.20170886
[5]	Zhu Weijun, You Qingguang, Yang Weidong, Zhou Qinglei. Trajectory Privacy Preserving Based on Statistical Differential Privacy[J]. Journal of Computer Research and Development, 2017, 54(12): 2825-2832. DOI: 10.7544/issn1000-1239.2017.20160647
[6]	Wu Yingjie, Zhang Liqun, Kang Jian, Wang Yilei. An Algorithm for Differential Privacy Streaming Data Adaptive Publication[J]. Journal of Computer Research and Development, 2017, 54(12): 2805-2817. DOI: 10.7544/issn1000-1239.2017.20160555
[7]	Wang Liang, Wang Weiping, Meng Dan. Privacy Preserving Data Publishing via Weighted Bayesian Networks[J]. Journal of Computer Research and Development, 2016, 53(10): 2343-2353. DOI: 10.7544/issn1000-1239.2016.20160465
[8]	Lu Guoqing, Zhang Xiaojian, Ding Liping, Li Yanfeng, Liao Xin. Frequent Sequential Pattern Mining under Differential Privacy[J]. Journal of Computer Research and Development, 2015, 52(12): 2789-2801. DOI: 10.7544/issn1000-1239.2015.20140516
[9]	Ouyang Jia, Yin Jian, Liu Shaopeng, Liu Yubao. An Effective Differential Privacy Transaction Data Publication Strategy[J]. Journal of Computer Research and Development, 2014, 51(10): 2195-2205. DOI: 10.7544/issn1000-1239.2014.20130824
[10]	Ni Weiwei, Chen Geng, Chong Zhihong, Wu Yingjie. Privacy-Preserving Data Publication for Clustering[J]. Journal of Computer Research and Development, 2012, 49(5): 1095-1104.

Cited By

Cited by

Periodical cited type(5)

1.	张涵，于航，周继威，白云开，赵路坦. 面向隐私计算的可信执行环境综述. 计算机应用. 2025(02): 467-481 .
2.	付裕，林璟锵，冯登国. 虚拟化与密码技术应用:现状与未来. 密码学报(中英文). 2024(01): 3-21 .
3.	徐传康，李忠月，刘天宇，种统洪，杨发雪. 基于可信执行环境的汽车域控系统安全研究. 汽车实用技术. 2024(15): 18-25+73 .
4.	徐文嘉，岑孟杰，陈亮. 隐私保护下单细胞RNA测序数据细胞分类研究. 医学信息学杂志. 2024(10): 86-89 .
5.	孙钰，熊高剑，刘潇，李燕. 基于可信执行环境的安全推理研究进展. 信息网络安全. 2024(12): 1799-1818 .