Citation: | Li Shuaijiang, Zhang Xinyuan, Zhao Jiacheng, Tian Xinghui, Shi Xiyu, Xu Xiaoxin, Cui Huimin. Automatic Insertion of High-Performance Synchronization Primitives for Ascend Processors[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440093 |
Instruction-Level Parallelism (ILP) is a classic challenge in processor architecture research. Domain-specific architectures, such as the Ascend processor, expose more pipeline details to upper-layer software, and compilers/programmers explicitly control the synchronization between pipelines to optimize ILP. However, the physical synchronization resources between pipelines are limited, which limits the improvement of ILP. To address this issue, a high-performance automatic synchronization primitive insertion method for the Ascend processor is proposed. By introducing the abstraction of "virtual synchronization resources," this method decouples the insertion of synchronization primitives from the selection of physical synchronization resources. Firstly, a heuristic algorithm is proposed to insert virtual synchronization primitives in complex control flow graphs. Then, a significant number of virtual synchronization resources are mapped to an extremely limited number of physical synchronization resources through virtual synchronization primitive merging and other techniques. At the same time, redundant synchronization primitives in the program are removed based on the partial order relationship between instructions, while ensuring program correctness and stringent hardware resource constraints. Experiments on the Ascend 910A platform using instruction-level and operator-level benchmark programs show that the programs with automatically inserted synchronization primitives achieve performance comparable to or on par with those manually inserted by expert programmers, while ensuring correctness.
[1] |
Tomasulo R M. An efficient algorithm for exploiting multiple arithmetic units[J]. IBM Journal of research and Development, 1967, 11(1): 25−33 doi: 10.1147/rd.111.0025
|
[2] |
Smith J E. A study of branch prediction strategies[C] //Proc of the 8th annual Symp on Computer Architecture. New York: ACM, 1981: 135−148
|
[3] |
Thornton J E. Parallel operation in the control data 6600[C] //Proc of the Fall Joint Computer Conf,Part II:Very High Speed Computer Systems. New York:ACM,1964:33−40(没有届
|
[4] |
Moudgill M, Pingali K, Vassiliadis S. Register renaming and dynamic speculation: an alternative approach[C] //Proc of the 26th Annual Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 1993: 202−213
|
[5] |
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C] //Proc of the 25th Int Conf on Neural Information Processing Systems - Volume 1. New York: Curran Associates Inc, 2012: 1097 - 1105
|
[6] |
张蕊,李锦涛. 基于深度学习的场景分割算法研究综述[J]. 计算机研究与发展,2020,57(4):859−875 doi: 10.7544/issn1000-1239.2020.20190513
Zhang Rui, Li Jintao. A survey on algorithm research of scene parsing based on deep learning[J]. Journal of Computer Research and Development, 2020, 57(4): 859−875 (in Chinese) doi: 10.7544/issn1000-1239.2020.20190513
|
[7] |
Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C] //Proc of the 34th Int Conf on Neural Information Processing Systems. New York: Curran Associates Inc, 2020: 1877 - 1901
|
[8] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. New York: Curran Associates Inc, 2017: 6000 - 6010
|
[9] |
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold[J]. Nature, 2021, 596(7873): 583−589 doi: 10.1038/s41586-021-03819-2
|
[10] |
李洪顺,于华,宫秀军. 一种只利用序列信息预测RNA结合蛋白的深度学习模型[J]. 计算机研究与发展,2018,55(1):93−101 doi: 10.7544/issn1000-1239.2018.20160508
Li Hongshun, Yu Hua, Gong Xiujun. A deep learning model for predicting RNA-binding proteins only from primary sequences[J]. Journal of Computer Research and Development, 2018, 55(1): 93−101 (in Chinese) doi: 10.7544/issn1000-1239.2018.20160508
|
[11] |
OpenAI. AI and compute[EB/OL]. [2024-02-01]. https://openai.com/ research/ai-and-compute
|
[12] |
Nvidia. NVIDIA tensor cores[EB/OL]. [2024-02-01]. https://www.nvidia.com/en-us/data-center/tensor-cores/
|
[13] |
AMD. AMD Instinct™ MI300 series accelerators[EB/OL]. [2024-02-01]. https://www.amd.com/en/products/accelerators/instinct/mi300.html
|
[14] |
海光. 海光深度计算处理器[EB/OL]. [2024-02-01]. https://www.hygon.cn/product/accelerator
Hygon. Hygon deep computing unit[EB/OL]. [2024-02-01]. https://www.hygon.cn/product/accelerator (in Chinese)
|
[15] |
Chen Tianshi, Du Zidong, Sun Ninghui, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning[C] //Proc of the 19th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 269–284
|
[16] |
Liao Heng, Tu Jiajin, Xia Jing, et al. Ascend: A scalable and unified architecture for ubiquitous deep neural network computing [C] //Proc of the 27th IEEE Int Symp on High-Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2021: 789−801
|
[17] |
Liao Heng, Tu Jiajin, Xia Jing, et al. DaVinci: A scalable architecture for neural network computing[C/OL] //Proc of the 31st Hot Chips Symp. Piscataway, NJ: IEEE, 2019 [2024-02-01]. https://www.old.hotchips.org/hc31/HC31_1.11_Huawei.Davinci.HengLiao_v4.0.pdf
|
[18] |
Huawei. CANN 8.0. RC1. alpha001 [EB/OL]. [2024-02-01]. https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha001/devguide/devguide/devguide_0001.html
|
[19] |
Barham P, Isard M. Machine learning systems are stuck in a rut[C] //Proc of the 17th Workshop on Hot Topics in Operating Systems. Berkeley, CA: USENIX Association, 2019: 177−183
|
[20] |
Nicolau A, Li Guangqiang, Kejariwal A. Techniques for efficient placement of synchronization primitives[C] //Proc of the 14th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2009: 199−208
|
[21] |
Zhai A, Steffan J G, Colohan C B, et al. Compiler and hardware support for reducing the synchronization of speculative threads[J]. ACM Transactions on Architecture and Code Optimization, 2008, 5(1): 1−33
|
[22] |
Nicolau A, Li Guangqiang, Veidenbaum A V, et al. Synchronization optimizations for efficient execution on multi-cores[C] //Proc of the 23rd Int Conf on Supercomputing. New York: ACM, 2009: 169−180
|
[23] |
Tseng C W. Compiler optimizations for eliminating barrier synchronization[J]. ACM SIGPLAN Notices, 1995, 30(8): 144−155 doi: 10.1145/209937.209952
|
[24] |
Chen Dingkai, Yew P C. Redundant synchronization elimination for DOACROSS loops[J]. IEEE Transactions on Parallel and Distributed Systems, 1999, 10(5): 459−470 doi: 10.1109/71.770138
|
[25] |
Aldrich J, Chambers C, Sirer E G, et al. Static analyses for eliminating unnecessary synchronization from java programs[C] //Proc of the 6th Int Symp on Static Analysis. Berlin: Springer, 1999: 19–38
|
[26] |
Bogda J, Hölzle U. Removing unnecessary synchronization in Java [C] //Proc of the 14th ACM SIGPLAN Conf on Object-Oriented Programming, Systems, Languages, and Applications. New York: ACM, 1999: 35−46
|
[27] |
Han H, Tseng C W. Compile-time synchronization optimizations for software DSMs[C] //Proc of the 1st Merged Int Parallel Processing Symp and Symp on Parallel and Distributed Processing. Piscataway, NJ: IEEE, 1998: 662−669
|
[28] |
Li Ang, van den Braak G J, Corporaal H, et al. Fine-grained synchronizations and dataflow programming on GPUs[C] //Proc of the 29th ACM on Int Conf on Supercomputing. New York: ACM, 2015: 109−118
|
[29] |
Liu Lifeng, Liu Meilin, Wang Chongjun, et al. Compile-time automatic synchronization insertion and redundant synchronization elimination for GPU kernels[C] //Proc of the 22nd Int Conf on Parallel and Distributed Systems. Piscataway, NJ: IEEE, 2016: 826−834
|
[30] |
Moses W S, Ivanov I R, Domke J, et al. High-performance gpu-to-cpu transpilation and optimization via high-level parallel constructs[C] //Proc of the 28th ACM SIGPLAN Annual Symp on Principles and Practice of Parallel Programming. New York: ACM, 2023: 119−134
|
[31] |
Sorensen T, Donaldson A F, Batty M, et al. Portable inter-workgroup barrier synchronisation for GPUs[C] //Proc of the 31st ACM SIGPLAN Int Conf on Object-Oriented Programming, Systems, Languages, and Applications. New York: ACM, 2016: 39−58
|
[32] |
LLVM. Loop-simplify: Canonicalize natural loops[EB/OL]. [2024-03-20]. https://llvm.org/docs/Passes.html#loop-simplify-canonicalize-natural-loops
|
[33] |
AMD. Register pressure in AMD CDNA2™ GPUs[EB/OL]. [2024-03-20]. https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-register-pressure-readme
|
[34] |
Poletto M, Sarkar V. Linear scan register allocation[J]. ACM Transactions on Programming Languages and Systems (TOPLAS), 1999, 21(5): 895−913 doi: 10.1145/330249.330250
|