Abstract:
Instruction-Level Parallelism (ILP) is a classic challenge in processor architecture research. Domain-specific architectures, such as the Ascend processor, expose more pipeline details to upper-layer software, and compilers/programmers explicitly control the synchronization between pipelines to optimize ILP. However, the physical synchronization resources between pipelines are limited, which limits the improvement of ILP. To address this issue, a high-performance automatic synchronization primitive insertion method for the Ascend processor is proposed. By introducing the abstraction of "virtual synchronization resources," this method decouples the insertion of synchronization primitives from the selection of physical synchronization resources. Firstly, a heuristic algorithm is proposed to insert virtual synchronization primitives in complex control flow graphs. Then, a significant number of virtual synchronization resources are mapped to an extremely limited number of physical synchronization resources through virtual synchronization primitive merging and other techniques. At the same time, redundant synchronization primitives in the program are removed based on the partial order relationship between instructions, while ensuring program correctness and stringent hardware resource constraints. Experiments on the Ascend 910A platform using instruction-level and operator-level benchmark programs show that the programs with automatically inserted synchronization primitives achieve performance comparable to or on par with those manually inserted by expert programmers, while ensuring correctness.