高级检索
    陈昱, 刘中金, 赵威威, 马原, 石志强, 孙利民. 一种大规模的跨平台同源二进制文件检索方法[J]. 计算机研究与发展, 2018, 55(7): 1498-1507. DOI: 10.7544/issn1000-1239.2018.20180078
    引用本文: 陈昱, 刘中金, 赵威威, 马原, 石志强, 孙利民. 一种大规模的跨平台同源二进制文件检索方法[J]. 计算机研究与发展, 2018, 55(7): 1498-1507. DOI: 10.7544/issn1000-1239.2018.20180078
    Chen Yu, Liu Zhongjin, Zhao Weiwei, Ma Yuan, Shi Zhiqiang, Sun Limin. A Large-Scale Cross-Platform Homologous Binary Retrieval Method[J]. Journal of Computer Research and Development, 2018, 55(7): 1498-1507. DOI: 10.7544/issn1000-1239.2018.20180078
    Citation: Chen Yu, Liu Zhongjin, Zhao Weiwei, Ma Yuan, Shi Zhiqiang, Sun Limin. A Large-Scale Cross-Platform Homologous Binary Retrieval Method[J]. Journal of Computer Research and Development, 2018, 55(7): 1498-1507. DOI: 10.7544/issn1000-1239.2018.20180078

    一种大规模的跨平台同源二进制文件检索方法

    A Large-Scale Cross-Platform Homologous Binary Retrieval Method

    • 摘要: 近年来由于代码的交叉复用,同源二进制文件广泛存在于物联网设备固件中.当某个固件被爆出漏洞二进制文件时,则包含该同源二进制文件的其他固件也将处于高风险中.因此同源二进制文件检索对于物联网固件的安全分析与应急响应具有重要意义.然而,目前缺少一种大规模且有效的针对嵌入式设备二进制文件的大规检索方法.传统的基于“一对一”关联匹配的同源检索方法的时间复杂度是O(N),不能满足大规模同源检索的需求.设计和实现了一种时间复杂度为O(lgN)的面向物联网设备固件的同源二进制文件检索方法.该方法的核心思想是通过深度学习网络编码二进制文件中的可读字符串,然后对编码向量生成局部敏感Hash从而实现快速检索.按照16种不同的编译参数编译了893个开源组件,共生成71 129对带标签的二进制文件来训练和测试网络模型.结果表明:该方法的ROC特性好于传统方法.此外,实际应用案例表明:该方法只需不到1 s的时间即可完成一次针对22 594个固件的同源二进制文件检索任务.

       

      Abstract: Due to the extensive code reuse, homologous binaries are widely found in IoT firmwares. Once a vulnerability is found in one firmware, other firmwares sharing the similar piece of codes are at high risk. Thus, homologous binary search is of great significance to IoT firmware security analysis. However, there are still no scalable and efficient homologous binary search methods for IoT firmwares. The time complexity of the traditional method is O(N), so it is not scalable for large-scale IoT firmwares. In this paper, we design, implement, and evaluate a scalable and efficient homologous binary search scheme for IoT firmwares with time complexity O(lgN). The main idea of our methodology is encoding binary file’s readable strings by deep learning network and then generating a local sensitive Hash of the encoding vector for the fast retrieval. We compiled 893 open source components based on 16 different compile-time parameters, resulting in 71 129 pairs of labeled binary files for training and testing the network model. The results show that our method has better ROC characteristics than the traditional method. In addition, the study case shows that our method can complete one homologous binary file retrieval task for 22 594 firmware in less than 1 second.

       

    /

    返回文章
    返回