LASM：面向全程序分析的x86内嵌汇编代码自动提升

赖远明; 王喆; 武成岗; 于丁; 侯承轩

doi:10.7544/issn1000-1239.202440881

摘要: C/C++和汇编语言混合编程在软件开发中被广泛使用。汇编语言的复杂性让混合编程容易出错。静态程序分析工具能减少软件漏洞，提高其鲁棒性。但基于LLVM IR开发的静态程序分析工具无法分析程序中的内嵌汇编代码。内嵌汇编代码影响程序中的数据流和控制流。静态分析工具忽略汇编代码导致程序中部分数据流未被纳入到分析中，使得误报率增加，准确率降低。设计并实现了LASM系统，该系统能将C/C++代码中的x86内嵌汇编代码自动提升成LLVM IR，并与编译C/C++得到的中间表示（IR）相融合，让下游的静态程序分析工具能对程序的所有代码进行全程序分析，降低误报率，提高准确率。LASM通过模拟硬件执行环境并使用LLVM IR的指令模拟汇编指令的语义，实现了汇编代码到LLVM IR的提升，并针对静态分析提供了原子指令语义简化和指针运算指令语义提升的优化。LASM能正确完整地模拟汇编指令的语义，提升汇编代码得到的LLVM IR代码在编译链接成可执行文件后可以正确运行。同时，LASM能和下游基于LLVM IR的静态程序分析工具无缝衔接，不需要任何适配即可增强工具的分析效果，达到复用和增强大量已有分析工具的目标。实验表明，LASM能支持针对Linux kernel的静态程序分析。在基于SVF分析Linux kernel对其全局变量的访问，并基于访问发生在初始化阶段还是初始化后阶段对变量进行分类的实验中，LASM的支持使分析精度提高了8.06个百分点，错误分类的变量数量下降了62.03%，在没有引入新分类错误的情况下多找到了14.80%的可保护变量。

Abstract: Integrating C/C++ with assembly language is a common practice in software development, particularly in operating system-level software. However, programming with inline assembly can be more error-prone due to its complex architecture-related details and the lack of compiler checks. Modern software engineering employs static analysis to detect bugs and security vulnerabilities. Nevertheless, analyzing inline assembly poses challenges beyond the capabilities of LLVM IR-based static analysis tools. If the effects of inline assembly on data flow and control flow are not addressed effectively, the result can be a rise in false positives and a decline in precision. We present LASM, an innovative tool that can automatically lift x86 inline assembly code into LLVM IR and integrate it with the IR generated from compiling the C/C++ code. LASM enables downstream static analysis tools to perform whole program analysis on applications containing inline assembly code fragments, thereby reducing false positives and improving precision. It emulates hardware and the semantics of assembly instructions using LLVM IR. By providing optional optimizations that simplify the semantics of atomic instructions and enhance pointer arithmetic semantics, LASM supports static analysis tools more effectively. By completely and accurately emulating assembly instructions, the LLVM IR lifted by LASM can be compiled into an executable program. The strength of LASM lies in its ease of integration; LLVM IR-based static analysis tools can analyze inline assembly code by incorporating it seamlessly into their existing workflows without requiring additional modifications or adaptations to their implementations. LASM facilitates the reuse of many pre-existing tools in the ecosystem. Our experimental results indicate that LASM effectively supports SVF-based static analysis on the Linux kernel. This analysis systematically categorizes global variables based on whether their access occurs during the initialization or post-initialization phase. The results show an impressive 62.03% reduction in misclassifications and an 8.06 percentage point increase in precision. LASM helps to identify 14.80% more protectable variables without introducing new misclassifications. These results highlight LASM’s potential to enhance the effectiveness of static analysis in the presence of inline assembly code.

LASM：面向全程序分析的x86内嵌汇编代码自动提升

LASM: Automatically Lift x86 Inline Assembly for Whole Program Analysis