Citation: | Ye Guixin, Zhang Yuxiang, Zhang Cheng, Zhao Jiaqi, Wang Huanting. Automatic Optimization Heuristics Method for OpenCL Program Based on Graph Neural Network[J]. Journal of Computer Research and Development, 2023, 60(5): 1121-1135. DOI: 10.7544/issn1000-1239.202110943 |
The last decade years witnessed the rapid development of heterogeneous computer architecture due to the popularization of the Internet of things. As the first cross-platform heterogeneous parallel computing framework, OpenCL(open computing language)has the advantages of standardization and portability. However, OpenCL has certain defects in performance portability because of the complexity and diversity of software and hardware platforms. To address this problem, prior methods leverage deep learning to build an optimization model. But they suffer from an insignificant code optimization effect because existing deep learning-based methods only capture the order dependencies of the program while ignoring the syntactic and semantic information. To this end, we propose ACCEPT, an automated heuristic optimization on OpenCL programs by building a multi-relational graph neural network. Differ from existing methods, ACCEPT first extracts the deep structure and semantic features of the OpenCL program by constructing a multi-relational code graph, then applies an improved graph neural network to extract the high-dimensional feature representation of the constructed code graph. Finally, a decision neural network is used to yield the optimization parameters. We evaluate ACCEPT with heterogeneous device mapping and thread coarsening factor prediction tasks. The experimental results show that ACCEPT outperforms state-of-the-art (SOTA) methods. On the heterogeneous device mapping task, the accuracy can reach 88.7%, and the speedup can be increased by 7.6% compared with the SOTA methods. On the thread coarsening task, the speedup is 5.2% higher than SOTA methods.
[1] |
Dietze R, Rünger G. The search-based scheduling algorithm HP* for parallel tasks on heterogeneous platforms[J]. Concurrency and Computation: Practice and Experience, 2020, 32(21): e5898
|
[2] |
Nvidia. OpenCL [EB/OL]. [2021-09-10].https://developer.nvidia.com/opencl
|
[3] |
Pennycook S J, Hammond S D, Wright S A, et al. An investigation of the performance portability of OpenCL[J]. Journal of Parallel and Distributed Computing, 2013, 73(11): 1439−1450 doi: 10.1016/j.jpdc.2012.07.005
|
[4] |
Grewe D, Wang Zheng, O'Boyle M F P. Portable mapping of data parallel programs to OpenCL for heterogeneous systems[C/OL]//Proc of the 11th IEEE/ACM Int Symp on Code Generation and Optimization (CGO). Piscataway, NJ: IEEE, 2013 [2021-09-10]. https://ieeexplore.ieee.org/document/6494993
|
[5] |
Balasalle J, Lopez M A, Rutherford M J. Optimizing Memory Access Patterns for Cellular Automata on GPUs[M]//GPU Computing Gems Jade Edition. San Francisco, CA : Morgan Kaufmann, 2012: 67 − 75
|
[6] |
沈元, 严寒冰, 夏春和, 等. 一种基于深度学习的恶意代码克隆检测技术[J/OL]. 北京航空航天大学学报, 2021[2021-09-10].https://doi.org/10.13700/j.bh.1001-5965.2020.0400
Shen Yuan, Yan Hanbing, Xia Chunhe, et al. A novel method for malware clone detection based on deep learning[J/OL]. Journal of Beijing University of Aeronautics and Astronautics, 2021[2021-09-10].https://doi.org/10.13700/j.bh.1001-5965.2020.0400(in Chinese)
|
[7] |
Cummins C, Petoumenos P, Murray A, et al. Compiler fuzzing through deep learning[C]//Proc of the 27th ACM SIGSOFT Int Symp on Software Testing and Analysis. New York: ACM, 2018: 95−105
|
[8] |
林若钦,罗琼. 基于可变形卷积神经网络的软件漏洞检测算法[J]. 计算机仿真,2021,38(3):405−409 doi: 10.3969/j.issn.1006-9348.2021.03.083
Lin Ruoqin, Luo Qiong. Software vulnerability detection algorithm based on deformable convolutional neural network[J]. Computer Integrated Manufacturing Systems, 2021, 38(3): 405−409 (in Chinese) doi: 10.3969/j.issn.1006-9348.2021.03.083
|
[9] |
Cummins C, Petoumenos P, Wang Zheng, et al. End-to-end deep learning of optimization heuristics[C]// Proc of the 26th Int Conf on Parallel Architectures and Compilation Techniques (PACT). Piscataway, NJ: IEEE, 2017: 219−232
|
[10] |
Chen Tianqi, Moreau T, Jiang Ziheng, et al. TVM: An automated end-to-end optimizing compiler for deep learning[C]// Proc of the 13th USENIX Symp on Operating Systems Design and Implementation (OSDI’18). Berkeley, CA: USENIX Association, 2018: 578−594
|
[11] |
Malhotra P, Vig L, Shroff G, et al. Long short term memory networks for anomaly detection in time series[C]//Proc of the 23rd European Symp on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges, Belgium: Louvain-la-Neuve Ciaco, 2015: 89−94
|
[12] |
Zhou Jie, Cui Ganqu, Hu Shengding, et al. Graph neural networks: A review of methods and applications[J]. AI Open, 2020, 1: 57−81 doi: 10.1016/j.aiopen.2021.01.001
|
[13] |
LLVM. The LLVM compiler infrastructure[EB/OL]. [2021-09-10].https://llvm.org/
|
[14] |
Ganin Y, Ustinova E, Ajakan H, et al. Domain-adversarial training of neural networks[J]. The Journal of Machine Learning Research, 2016, 17(1): 1−35
|
[15] |
Magni A, Dubach C, O'Boyle M. Automatic optimization of thread-coarsening for graphics processors[C]//Proc of the 23rd Int Conf on Parallel Architectures and Compilation. New York: ACM, 2014: 455−466
|
[16] |
Cooper K D, Schielke P J, Subramanian D. Optimizing for reduced code space using genetic algorithms[C]//Proc of the 2nd ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems. New York: ACM, 1999: 1−9
|
[17] |
Cummins C, Fisches Z V, Ben-Nun T, et al. Programl: Graph-based deep learning for program optimization and analysis[J]. arXiv preprint, arXiv: 2003.10536, 2020
|
[18] |
Nugteren C, Codreanu V. CLTune: A generic auto-tuner for OpenCL kernels[C]// Proc of the 9th IEEE Int Symp on Embedded Multicore/Many-core Systems-on-Chip. Piscataway, NJ: IEEE, 2015: 195−202
|
[19] |
Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs[C]//Proc of the 31st Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2017: 1025−1035
|
[20] |
Battaglia P W, Hamrick J B, Bapst V, et al. Relational inductive biases, deep learning, and graph networks[J]. arXiv preprint, arXiv: 1806.01261, 2018
|
[21] |
Tzeng E, Hoffman J, Saenko K, et al. Adversarial discriminative domain adaptation[C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 7167−7176
|
[22] |
Cummins C, Petoumenos P, Wang Zheng, et al. Synthesizing benchmarks for predictive modeling[C]//Proc of the 15th IEEE/ACM Int Symp on Code Generation and Optimization (CGO). Piscataway, NJ: IEEE, 2017: 86−99
|
[23] |
Munshi A, Gaster B, Mattson T G, et al. OpenCL Programming Guide[M]. Cambridge, MA: Addison-Wesley, 2011
|
[24] |
Allamanis M, Sutton C. Mining source code repositories at massive scale using language modeling[C]//Proc of the 10th Working Conf on Mining Software Repositories (MSR). Piscataway, NJ: IEEE, 2013: 207−216
|
[25] |
Mikolov T, Sutskever I, Chen Kai, et al. Distributed representations of words and phrases and their compositionality[C]//Proc of the 26th Advances in Neural Information Processing Systems. New York: Curran Associates, 2013: 3111−3119
|
[26] |
Balles L, Hennig P. Dissecting Adam: The sign, magnitude and variance of stochastic gradients[C]//Proc of the 35th Int Conf on Machine Learning. Cambridge, MA: JMLR , 2018: 404−413
|
[27] |
Stratton J A, Rodrigues C, Sung I J, et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing[J/OL]. Center for Reliable and High-Performance Computing, 2012[2021-09-10]. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.8896&rep=rep1&type=pdf
|
[28] |
NVIDIA. NVIDIA CUDA SDK [EB/OL]. 2012 [2021-09-10]. http://developer.nvidia.com/ object/cuda.html
|
[29] |
AMD. AMD/ATI stream SDK[EB/OL]. 2010 [2021-09-10]. http://www.amd.com/stream/
|
[30] |
Danalis A, Marin G, McCurdy C, et al. The scalable heterogeneous computing (SHOC) benchmark suite[C]//Proc of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. New York: ACM, 2010: 63−74
|
[31] |
Seo S, Jo G, Lee J. Performance characterization of the NAS parallel benchmarks in OpenCL[C]//Proc of the 12th IEEE Int Symp on Workload Characterization (IISWC). Piscataway, NJ: IEEE, 2011: 137−148
|
[32] |
Che Shuai, Boyer M, Meng Jiayuan, et al. Rodinia: A benchmark suite for heterogeneous computing[C]//Proc of the 10th IEEE Int Symp on Workload Characterization (IISWC). Piscataway, NJ: IEEE, 2009: 44−54
|
[33] |
Grauer-Gray S, Xu Lifan, Searles R, et al. Auto-tuning a high-level language targeted to GPU codes[C/OL]//Proc of the 1st Innovative Parallel Computing (InPar). Piscataway, NJ: IEEE, 2012[2021-09-10].https://ieeexplore.ieee.org/document/6339595
|
[34] |
Ben-Nun T, Jakobovits A S, Hoefler T. Neural code comprehension: A learnable representation of code semantics[J]. arXiv preprint, arXiv: 1806.07336, 2018
|
[35] |
Brauckmann A, Goens A, Ertel S, et al. Compiler-based graph representations for deep learning models of code[C]//Proc of the 29th Int Conf on Compiler Construction. New York: ACM, 2020: 201−211
|
[36] |
Li Yujia, Tarlow D, Brockschmidt M, et al. Gated graph sequence neural networks[J]. arXiv preprint, arXiv: 1511.05493, 2015
|
[37] |
Brockschmidt M. GNN-FILM: Graph neural networks with feature-wise linear modulation[C]//Proc of the 37th Int Conf on Machine Learning. Cambridge, MA: JMLR, 2020: 1144−1152
|
[38] |
Schlichtkrull M, Kipf T N, Bloem P, et al. Modeling relational data with graph convolutional networks[C]//Proc of the 15th European Semantic Web Conf. Berlin: Springer, 2018: 593−607
|
[39] |
Microsoft. Graph neural network with edge MLPs[EB/OL]. 2017 [2021-09-10].https://github.com/microsoft/tf2-gnn#schlichtkrull-et-al-2017
|