摘要:
对指令集并行(ILP)的追求贯穿了处理器体系结构的软硬件研究.传统的通用处理器通过激进的微架构优化,如乱序执行、分支预测等承担了指令集并行的核心职责,但是在快速发展的领域定制架构当中,为了追求更高的能效/面效,通常放弃了对软件透明的ILP微架构优化技术,直接将底层执行单元的并行控制完全暴露给了上层软件.昇腾处理器是一个典型的多核领域定制架构,单核内集成了多种功能不同的处理流水线,硬件层面不提供流水线之间的执行顺序保证,依赖于软件层面的同步语句来保证程序执行的正确性和高性能.针对这一问题,本文提出了一种面向昇腾处理器的高性能同步原语自动插入方法,核心在于“虚拟同步资源”的抽象,将同步原语的插入和物理同步资源的选择进行解耦.首先通过启发式方法进行虚拟同步原语插入,解决了复杂的控制流图和执行路径所带来的一系列问题;然后通过虚拟同步原语合并等技术,解决了大量虚拟同步资源到极有限数量的物理同步资源的映射问题;同时在满足程序正确性与严苛硬件资源限制的前提下,根据指令间的偏序关系删除程序中冗余的同步原语,进一步提升了硬件的利用效率.使用指令级与算子级基准测试程序在昇腾910A平台上的实验表明,该方法自动插入同步原语的程序性能接近甚至略优于专家程序员手工调优的程序.
Abstract:
The pursuit of instruction set parallelism (ILP) permeates software and hardware research in processor architecture. Traditional general-purpose processors assume the core responsibility for instruction set parallelism through radical microarchitectural optimizations, such as out-of-order execution, branch prediction, and so on. However, in rapidly evolving domain-specific architectures, in the pursuit of higher energy efficiency and area efficiency, the traditional ILP microarchitecture optimization technology, which is transparent to the software, is often abandoned. Instead, the parallel control of the underlying execution unit is directly exposed to the upper-layer software. The Ascend processor is a typical multi-core domain-specific architecture, where a single core integrates multiple processing pipelines with distinct functions. The hardware does not guarantee the execution order between pipelines; instead, it relies on synchronization statements at the software level to ensure program correctness and achieve high performance. To address this problem, this paper proposes an automatic insertion method for synchronization primitives for Ascend processors. The key lies in the abstraction of "virtual synchronization resources", decoupling the insertion of synchronization primitives from the selection of physical synchronization resources. Initially, virtual synchronization primitives are inserted using a heuristic approach, resolving a series of challenges posed by complex control flow graphs and execution paths. Subsequently, through techniques such as the merging of virtual synchronization primitives, the mapping problem of a significant number of virtual synchronization resources to an extremely limited quantity of physical synchronization resources is addressed. Simultaneously, while satisfying program correctness and stringent hardware resource constraints, redundant synchronization primitives in the program are removed based on the partial order relationship between instructions, further enhancing hardware utilization efficiency. Experimental results using instruction-level and operator-level benchmark programs on the Ascend 910A platform indicate that the performance of programs with automatically inserted synchronization primitives approaches or slightly surpasses that of programs manually tuned by expert programmers.