Citation: | Sun Qingxiao, Yang Hailong. Generalized Stencil Auto-Tuning Framework on GPU Platform[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440612 |
Stencil computations are widely adopted in scientific applications. Many HPC platforms utilize the high computation capability of GPUs to accelerate stencil computations. In recent years, stencils have become more complex in terms of stencil order, memory accesses, and computation patterns. To adapt stencil computations to GPU architectures, the academic community has proposed a variety of optimization techniques based on streaming and tiling. Due to the diversity of stencil computational patterns and GPU architectures, no single optimization technique fits all stencil instances. Therefore, researchers have proposed stencil auto-tuning mechanisms to conduct parameter searches for a given combination of optimization techniques. However, existing mechanisms introduce huge offline profiling costs and online prediction overhead, unable to be flexible to arbitrary stencil patterns. To address the above problems, this paper proposes a generalized stencil auto-tuning framework GeST, which achieves the ultimate performance optimization of stencil computations on GPU platforms. Specifically, GeST constructs the global search space through the zero-padding format, quantifying parameter correlations via the coefficient of variation to generate parameter groups. After that, GeST iteratively selects parameter values from the parameter groups, adjusting the sampling ratio according to the reward policy and avoiding redundant execution through hash coding. The experimental results show that GeST can identify better-performing parameter settings in a short time compared to other state-of-the-art auto-tuning works.
[1] |
Mullapudi R T, Vasista V, Bondhugula U. PolyMage: Automatic optimization for image processing pipelines [C] //Proc of the 20th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2015: 429−443
|
[2] |
Gamell M, Teranishi K, Heroux M A, et al. Local recovery and failure masking for stencil-based applications at extreme scales [C/OL] //Proc of the 27th Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2015 [2024-07-15]. https://dl.acm.org/doi/abs/10.1145/2807591.2807672
|
[3] |
Sano K, Hatsuda Y, Yamamoto S. Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth[J]. IEEE Transactions on Parallel and Distributed Systems, 2013, 25(3): 695−705
|
[4] |
Ravishankar M, Holewinski J, Grover V. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs [C] //Proc of the 8th Workshop on General Purpose Processing using GPUs. New York: ACM, 2015: 109−120
|
[5] |
Hagedorn B, Stoltzfus L, Steuwer M, et al. High performance stencil code generation with LIFT [C] //Proc of the 16th ACM/IEEE Int Symp on Code Generation and Optimization. New York: ACM, 2018: 100−112
|
[6] |
Rawat P S, Vaidya M, Sukumaran-Rajam A, et al. On optimizing complex stencils on GPUs [C] //Proc of the 33rd IEEE Int Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2019: 641–652
|
[7] |
Grosser T, Cohen A, Holewinski J, et al. Hybrid hexagonal/classical tiling for GPUs [C] //Proc of the 12th ACM/IEEE Int Symp on Code Generation and Optimization. New York: ACM, 2014: 66−75
|
[8] |
Ragan-Kelley J, Barnes C, Adams A, et al. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines[J]. ACM SIGPLAN Notices, 2013, 48(6): 519−530 doi: 10.1145/2499370.2462176
|
[9] |
Matsumura K, Zohouri H R, Wahib M, et al. AN5D: Automated stencil framework for high-degree temporal blocking on GPUs [C] //Proc of the 18th ACM/IEEE Int Symp on Code Generation and Optimization. New York: ACM, 2020: 199–211
|
[10] |
Sun Qingxiao, Liu Yi, Yang Hailong, et al. csTuner: Scalable auto-tuning framework for complex stencil computation on GPUs [C] //Proc of the 23rd IEEE Int Conf on Cluster Computing. Piscataway, NJ: IEEE, 2021: 192−203
|
[11] |
Sun Qingxiao, Liu Yi, Yang Hailong, et al. Adaptive auto-tuning framework for global exploration of stencil optimization on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(1): 20−33 doi: 10.1109/TPDS.2023.3325630
|
[12] |
Sun Qingxiao, Liu Yi, Yang Hailong, et al. QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU[J]. Parallel Computing, 2022, 113: 102958 doi: 10.1016/j.parco.2022.102958
|
[13] |
Rawat P S, Rastello F, Sukumaran-Rajam A, et al. Register optimizations for stencils on GPUs [C] //Proc of the 23rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2018: 168−182
|
[14] |
Sun Qingxiao, Liu Yi, Yang Hailong, et al. StencilMART: Predicting optimization selection for stencil computations across GPUs [C] //Proc of the 36th IEEE Int Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2022: 875–885
|
[15] |
Yuan Liang, Cao Hang, Zhang Yunquan, et al. Temporal vectorization for stencils [C/OL] //Proc of the 33rd Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021 [2024-07-15]. https://dl.acm.org/doi/abs/10.1145/3458817.3476149
|
[16] |
Li Kun, Yuan Liang, Zhang Yunquan, et al. Reducing redundancy in data organization and arithmetic calculation for stencil computations [C/OL] //Proc of the 33rd Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021 [2024-07-15]. https://dl.acm.org/doi/abs/10.1145/3458817.3476154
|
[17] |
Garvey J D, Abdelrahman T S. Automatic performance tuning of stencil computations on GPUs [C] //Proc of the 44th Int Conf on Parallel Processing. Piscataway, NJ: IEEE, 2015: 300−309
|
[18] |
Stock K, Kong M, Grosser T, et al. A framework for enhancing data reuse via associative reordering [C] //Proc of the 35th ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2014: 65−76
|
[19] |
Oh C, Zheng Zhen, Shen Xipeng, et al. GoPipe: A granularity-oblivious programming framework for pipelined stencil executions on GPU [C] //Proc of the 29th Int Conf on Parallel Architectures and Compilation Techniques. New York: ACM, 2020: 43−54
|
[20] |
Luo Yulong, Tan Guangming, Mo Zeyao, et al. FAST: A fast stencil autotuning framework based on an optimal solution space model [C] //Proc of the 29th Int Conf on Supercomputing. New York: ACM, 2015: 187−196
|
[21] |
Ansel J, Kamil S, Veeramachaneni K, J. Ragan-Kelley, et al. OpenTuner: An extensible framework for program autotuning [C] //Proc of the 23rd Int Conf on Parallel Architectures and Compilation Techniques. New York: ACM, 2014: 303−316
|
[22] |
Trümper L, Ben-Nun T, Schaad P, et al. Performance embeddings: A similarity-based transfer tuning approach to performance optimization [C] //Proc of the 37th Int Conf on Supercomputing. New York: ACM, 2023: 50−62
|
[23] |
Martínez V, Dupros F, Castro M, et al. Performance improvement of stencil computations for multi-core architectures based on machine learning[J]. Procedia Computer Science, 2017, 108: 305−314 doi: 10.1016/j.procs.2017.05.164
|
[24] |
Cosenza B, Durillo J J, Ermon S, et al. Autotuning stencil computations with structural ordinal regression learning [C] //Proc of the 31st IEEE Int Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2017: 287−296
|
[25] |
Nesterenko B, Yi Qing, Lin Pei-Hung, et al. Modeling optimization of stencil computations via domain-level properties [C] //Proc of the 13th Int Workshop on Programming Models and Applications for Multicores and Manycores. New York: ACM, 2022: 35−44
|
[26] |
Roy R B, Patel T, Gadepally V, et al. BLISS: Auto-tuning complex applications using a pool of diverse lightweight learning models [C] //Proc of the 42nd ACM SIGPLAN Int Conf on Programming Language Design and Implementation. New York: ACM, 2021: 1280−1295
|
[27] |
Christen M, Schenk O, Burkhart H. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures [C] //Proc of the 25th IEEE Int Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2011: 676−687
|
[28] |
Maruyama N, Nomura T, Sato K, et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers [C/OL] //Proc of the 23rd Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2011 [2024-07-15]. https://dl.acm.org/doi/abs/10.1145/2063384.2063398
|
[29] |
Rawat P S, Vaidya M, Sukumaran-Rajam A, et al. Domain-specific optimization and generation of high-performance GPU code for stencil computations[J]. Proceedings of the IEEE, 2018, 106(11): 1902−1920 doi: 10.1109/JPROC.2018.2862896
|
[30] |
de Fine Licht J, Kuster A, De Matteis T, et al. StencilFlow: Mapping large stencil programs to distributed spatial computing systems [C] //Proc of the 19th IEEE/ACM Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2021: 315−326
|
[31] |
曹杭,袁良,黄珊,等. 一种基于空间密铺的星型Stencil并行算法[J]. 计算机研究与发展,2020,57(12):2621−2634 doi: 10.7544/issn1000-1239.2020.20190734
Cao Hang, Yuan Liang, Huang Shan, et al. A parallel star stencil algorithm based on tessellating[J]. Journal of Computer Research and Development, 2020, 57(12): 2621−2634 (in Chinese) doi: 10.7544/issn1000-1239.2020.20190734
|
[32] |
Bisbas G, Lydike A, Bauer E, et al. A shared compilation stack for distributed-memory parallelism in stencil DSLs [C] //Proc of the 29th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2024: 38−56
|
[33] |
孙相征,张云泉,王婷,等. 对角线稀疏矩阵的SpMV自适应性能优化[J]. 计算机研究与发展,2013,50(3):648−656 doi: 10.7544/issn1000-1239.2013.20110104
Sun Xiangzheng, Zhang Yunquan, Wang Ting, et al. Auto-tuning of SpMV for diagonal sparse matrices[J]. Journal of Computer Research and Development, 2013, 50(3): 648−656 (in Chinese) doi: 10.7544/issn1000-1239.2013.20110104
|
[34] |
Li Ang, Song S L, Kumar A, et al. Critical points based register-concurrency autotuning for GPUs [C] //Proc of the 53rd Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2016: 1273−1278
|
[35] |
Pfaffe P, Grosser T, Tillmann M. Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping [C] //Proc of the 33rd Int Conf on Supercomputing. New York: ACM, 2019: 354−366
|
[36] |
Zheng Lianmin, Jia Chengfan, Sun Minmin, et al. Ansor: Generating high-performance tensor programs for deep learning [C] //Proc of the 14th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2020: 863−879
|
[37] |
Romero F, Zhao M, Yadwadkar N J, et al. LLAMA: A heterogeneous & serverless framework for auto-tuning video analytics pipelines [C/OL] //Proc of the 12th ACM Symp on Cloud Computing. New York: ACM, 2021 [2024-07-15]. https://dl.acm.org/doi/abs/10.1145/3472883.3486972
|
[38] |
Sun Qi, Zhang Xinyun, Geng Hao, et al. GTuner: Tuning DNN computations on GPU via graph attention network [C] //Proc of the 59th ACM/IEEE Design Automation Conf. New York: ACM, 2022: 1045−1050
|
[39] |
曹坤,龙赛琴,李哲涛. CPU-GPU MPSoC中使用寿命驱动的OpenCL应用调度方法[J]. 计算机研究与发展,2023,60(5):976−991 doi: 10.7544/issn1000-1239.202220700
Cao Kun, Long Saiqin, Li Zhetao. Lifetime-driven OpenCL application scheduling on CPU-GPU MPSoC[J]. Journal of Computer Research and Development, 2023, 60(5): 976−991 (in Chinese) doi: 10.7544/issn1000-1239.202220700
|
[40] |
Anastasiadis P, Papadopoulou N, Goumas G, et al. PARALiA: A performance aware runtime for auto-tuning linear algebra on heterogeneous systems[J]. ACM Transactions on Architecture and Code Optimization, 2023, 20(4): 1−25
|
[41] |
Xu Jinchen, Song Guanghui, Zhou Bei, et al. A holistic approach to automatic mixed-precision code generation and tuning for affine programs [C] //Proc of the 29th ACM SIGPLAN Annual Symp on Principles and Practice of Parallel Programming. New York: ACM, 2024: 55−67
|
[42] |
Li Chendi, Xu Yufan, Saravani S M, et al. Accelerated auto-tuning of GPU kernels for tensor computations [C] //Proc of the 38th ACM Int Conf on Supercomputing. New York: ACM, 2024: 549−561
|