高级检索

    SIMD-RVV动态二进制翻译优化:冗余设置消除与混合翻译驱动的跨架构编程模型适配

    SIMD-RVV Dynamic Binary Translation Optimization: Redundant Configuration Elimination and Hybrid Translation-Driven Cross-Architecture Programming Model Adaptation

    • 摘要: RISC-V因其开源和模块化设计等特性,已在嵌入式领域取得显著成功,并逐步向高性能计算(HPC)领域拓展. 面向HPC的RISC-V硬件(如Sophon SG2042多核处理器)已展现出与X86/ARM竞品相当的性能水平,但不完善的软件生态,是阻碍其发展的最大障碍之一. 我们开发了面向RISC-V的进程级动态二进制翻译器RVBT,用于将成熟的X86软件生态移植到RISC-V平台,加速RISC-V在HPC领域的应用进程. 针对HPC程序广泛依赖SIMD指令的特性,聚焦于解决SIMD与RVV间显著的编程模型差异导致的翻译性能瓶颈问题,提出了3项创新的优化方案. X86 SIMD将数据类型硬编码于操作码,而RVV需动态配置vtype和掩码寄存器,这导致直接翻译产生了大量冗余操作,严重拉低了翻译运行的效率. 通过充分利用程序数据类型的局部性,优化方案可删除跨架构适配编程模型导致的冗余设置,混合使用浮点扩展和向量扩展翻译SIMD指令并按需同步数据,大幅提升了SIMD指令的翻译运行效率. 这三项优化方案具备通用性,也适用于ARM平台的SIMD到RVV的翻译. 实验表明,以SPEC CPU 2006作为测试集,优化方案对csrr、vsetvl、vsetvli指令的平均动态消除率分别达到了100%,100%和56.31%,在浮点测试集上,掩码设置操作的平均动态消除率达到了74.66%,数据的平均动态同步率为67.35%. 优化后的RVBT在整点和浮点测试集上的平均运行效率达到了本地执行的47.39%和40.06%,相比优化前的加速比分别为1.21和8.31,并远超QEMU的18.84%和4.81%,展现出了应用于部分HPC场景的潜力.

       

      Abstract: RISC-V, renowned for its open-source nature and modular design, has achieved remarkable success in embedded systems and is progressively expanding into the high-performance computing (HPC) domain. While RISC-V hardware tailored for HPC, such as the Sophon SG2042 multi-core processors, has demonstrated performance level comparable to X86/ARM counterparts, its underdeveloped software ecosystem remains a critical barrier to broader adoption. To address this challenge, we developed RVBT, a process-level dynamic binary translator for RISC-V, designed to bridge the software gap by efficiently porting the mature X86 ecosystem to RISC-V platforms, thereby accelerating RISC-V’s integration into HPC applications. Focusing on the pervasive use of SIMD instructions in HPC programs, this study tackles the inefficiencies arising from fundamental differences in programming models between X86 SIMD and RISC-V Vector (RVV) extensions. Specifically, X86 SIMD hardcodes data types within opcodes, whereas RVV dynamically configures vtype and mask registers, leading to redundant operations during direct translation. To overcome this, we propose three innovative optimizations to achieve: 1) Redundancy elimination via data type locality. By leveraging the locality of data types in adjacent SIMD operations, we statically analyze and remove redundant configurations of vtype (achieving 100% dynamic elimination rates for csrr and vsetvl, and 56.31% for vsetvli) and mask settings (74.66% elimination rate in floating-point benchmarks). 2) Hybrid translation with on-demand synchronization. We decouple scalar and vectorized floating-point operations, translating X86 SIMD scalar double-precision instructions to RISC-V’s floating-point extensions and reserving RVV for vectorized operations. Data synchronization between scalar and vector registers is optimized through defuse analysis, achieving a 67.35% dynamic synchronization reduction in floating-point benchmarks. Experimental results on SPEC CPU 2006 demonstrate significant improvements on the optimized RVBT achieves 47.39% and 40.06% of native execution efficiency for integer and floating-point benchmarks, respectively, representing speedups of 1.21× and 8.31× over the unoptimized version. RVBT vastly outperforms QEMU (18.84% and 4.81% for integer and floating-point), with floating-point efficiency surpassing QEMU by 8.33 times, highlighting its potential for deployment in certain HPC scenarios. Crucially, these optimizations are architecture-agnostic: The methodology of exploiting data type locality, hybrid instruction translation, and adaptive synchronization apply equally to ARM SIMD (e.g., NEON) to RVV translation, offering a universal framework for cross-ISA binary compatibility. This work provides a pivotal technical foundation for breaking the software ecosystem deadlock and advancing RISC-V’s role in HPC.

       

    /

    返回文章
    返回