面向昇腾处理器的高性能同步原语自动插入方法

李帅江; 张馨元; 赵家程; 田行辉; 石曦予; 徐晓忻; 崔慧敏

doi:10.7544/issn1000-1239.202440093

面向昇腾处理器的高性能同步原语自动插入方法

Automatic Insertion Method of High-Performance Synchronization Primitives for Ascend Processors

摘要

摘要: 指令级并行（instruction level parallism，ILP）是处理器体系结构研究的经典难题. 以昇腾为代表的领域定制架构将更多的流水线细节暴露给上层软件，由编译器/程序员显式控制流水线之间的同步来优化ILP，但是流水线之间的物理同步资源是有限的，限制了ILP的提升. 针对这一问题，提出一种面向昇腾处理器的高性能同步原语自动插入方法，通过引入“虚拟同步资源”的抽象将同步原语的插入和物理同步资源的选择进行解耦. 首先提出了一种启发式算法在复杂的控制流图上进行虚拟同步原语的插入，随后通过虚拟同步原语合并等技术，将虚拟同步资源映射到有限数量的物理同步资源上，并同时在满足程序正确性与严苛硬件资源限制的前提下，根据指令间的偏序关系删除程序中冗余的同步原语. 使用指令级与算子级基准测试程序在昇腾910A平台上的实验表明，该方法自动插入同步原语的程序在保证正确性的基础上，整体性能与专家程序员手动插入同步原语接近或持平.

Abstract: Instruction-level parallelism (ILP) is a classic challenge in the field of processor architecture. Domain-specific architectures, such as the Ascend processor, expose more pipeline details to upper-layer software, and compilers/programmers explicitly control the synchronization between pipelines to optimize ILP. However, the physical synchronization resources between pipelines are limited, which limits the improvement of ILP. To address this issue, a high-performance automatic synchronization primitive insertion method for the Ascend processor is proposed. By introducing the abstraction of “virtual synchronization resources”, this method decouples the insertion of synchronization primitives from the selection of physical synchronization resources. Firstly, a heuristic algorithm is proposed to insert virtual synchronization primitives in complex control flow graphs. Then, a significant number of virtual synchronization resources are mapped to an extremely limited number of physical synchronization resources through virtual synchronization primitive merging and other techniques. At the same time, redundant synchronization primitives in the program are removed based on the partial order relationship between instructions, while ensuring program correctness and stringent hardware resource constraints. Experiments on the Ascend 910A platform using instruction-level and operator-level benchmark programs show that the programs with automatically inserted synchronization primitives achieve performance comparable to or on par with those manually inserted by expert programmers, while ensuring correctness.

HTML全文

参考文献(34)

施引文献

资源附件(0)