Abstract:
                                      Due to the extensive code reuse, homologous binaries are widely found in IoT firmwares. Once a vulnerability is found in one firmware, other firmwares sharing the similar piece of codes are at high risk. Thus, homologous binary search is of great significance to IoT firmware security analysis. However, there are still no scalable and efficient homologous binary search methods for IoT firmwares. The time complexity of the traditional method is O(N), so it is not scalable for large-scale IoT firmwares. In this paper, we design, implement, and evaluate a scalable and efficient homologous binary search scheme for IoT firmwares with time complexity O(lgN). The main idea of our methodology is encoding binary file’s readable strings by deep learning network and then generating a local sensitive Hash of the encoding vector for the fast retrieval. We compiled 893 open source components based on 16 different compile-time parameters, resulting in 71 129 pairs of labeled binary files for training and testing the network model. The results show that our method has better ROC characteristics than the traditional method. In addition, the study case shows that our method can complete one homologous binary file retrieval task for 22 594 firmware in less than 1 second.