ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2018, Vol. 55 ›› Issue (7): 1498-1507.doi: 10.7544/issn1000-1239.2018.20180078

Special Issue: 2018物联网安全专题

Previous Articles     Next Articles

A Large-Scale Cross-Platform Homologous Binary Retrieval Method

Chen Yu1,2,3, Liu Zhongjin4, Zhao Weiwei5, Ma Yuan1,2,3, Shi Zhiqiang1,2,3, Sun Limin1,2,3   

  1. 1(Beijing Key Laboratory of IoT Information Security (Institute of Information Engineering, Chinese Academy of Science), Beijing 100093); 2(Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093); 3(School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100093); 4(National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029); 5(School of Information Science & Engineering, Lanzhou University, Lanzhou 730000)
  • Online:2018-07-01

Abstract: Due to the extensive code reuse, homologous binaries are widely found in IoT firmwares. Once a vulnerability is found in one firmware, other firmwares sharing the similar piece of codes are at high risk. Thus, homologous binary search is of great significance to IoT firmware security analysis. However, there are still no scalable and efficient homologous binary search methods for IoT firmwares. The time complexity of the traditional method is O(N), so it is not scalable for large-scale IoT firmwares. In this paper, we design, implement, and evaluate a scalable and efficient homologous binary search scheme for IoT firmwares with time complexity O(lgN). The main idea of our methodology is encoding binary file’s readable strings by deep learning network and then generating a local sensitive Hash of the encoding vector for the fast retrieval. We compiled 893 open source components based on 16 different compile-time parameters, resulting in 71 129 pairs of labeled binary files for training and testing the network model. The results show that our method has better ROC characteristics than the traditional method. In addition, the study case shows that our method can complete one homologous binary file retrieval task for 22 594 firmware in less than 1 second.

Key words: binary search, cross-platform, deep learning, recurrent neural networks (RNN), local sensitive Hash

CLC Number: