ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (4): 875-884.doi: 10.7544/issn1000-1239.2018.20160871

• 软件技术 • 上一篇    下一篇



  1. 1(上海交通大学高性能计算中心 上海 200240); 2(东京工业大学 日本东京 1528550); 3(普林斯顿大学等离子体物理实验室 美国新泽西州普林斯顿 08540); 4(英伟达公司 新加坡 138522) (
  • 出版日期: 2018-04-01
  • 基金资助: 
    国家重点研发计划项目(2016YFB0201400,2016YFB0201800);美国自然科学基金跨学科合作项目(ACI-1440733);NVIDIA GPU全球卓越中心;日本学术振兴会RONPAKU项目(113209)

Porting and Optimizing GTC-P on TaihuLight Supercomputer with OpenACC

Wang Yichao1, Lin Xinhua1,2, Cai Linjin1, Tang William3, Ethier Stephane3, Wang Bei3, See Simon1,4, Satoshi Matsuoka2   

  1. 1(Center for High Performance Computing, Shanghai Jiao Tong University, Shanghai 200240); 2(Tokyo Institute of Technology, Tokyo, Japan 1528550); 3(Princeton Plasma Physics Laboratory Princeton University, Princeton, NJ, USA 08540); 4(NVIDIA, Singapore 138522)
  • Online: 2018-04-01

摘要: 神威“太湖之光”是最新一期Top500榜单上排名第一的超级计算机,实测峰值性能约93PFLOPS.该系统提供了基于指导语句的并行编程工具OpenACC,兼容OpenACC 2.0编程标准,并添加了部分定制化功能.GTC-P是一个具有重要物理意义的科学应用,算法基于高性能计算领域中被广泛使用的PIC(particle-in-cell)方法.利用神威OpenACC并行编程模型在“太湖之光”上成功移植了GTC-P应用.在移植过程中,鉴于OpenACC编译器尚无法解决的性能瓶颈,提出了3种基于中间代码二次开发的优化方法:1)消除原子操作;2)避免低效的全局访存操作;3)手动添加SIMD intrinsics指令.实验结果表明,在64个从核上相比1个主核,优化后的函数charge和push分别实现了1.6倍和86倍的加速比,同时GTC-P代码整体取得了2.5倍的加速比.优化结果证明了基于中间代码的手动优化对利用神威OpenACC移植的PIC算法在“太湖之光”上的性能提升非常重要.

关键词: 太湖之光, GTC-P, PIC算法, 神威, OpenACC

Abstract: Sunway TaihuLight with its sustainable performance achieving 93PFLOPS is now the No.1 supercomputer in the latest Top500 list. It provides a high-level directive language called OpenACC that is compatible with OpenACC 2.0 standard with some customized extensions. GTC-P is a discovery-science-capable real-world application code based on the particle-in-cell (PIC) algorithm that is well-established in the HPC area. Our motivation is to port GTC-P code on TaihuLight supercomputer with OpenACC. Since the Sunway OpenACC compiler cannot deal with the performance bottleneck of GTC-P at present when it is directly ported onto TaihuLight, we have applied three optimizations on an “intermediate” version of the code generated by the compiler: 1) elimination of atomic operations; 2) avoidance of expensive global memory access instructions; 3) addition of SIMD intrinsics manually. The results from our numerical experiments show that these optimizations produce 1.6X and 8.6X speed-up on 64 CPE cores compared with a 1 MPE core for the key charge and push kernel PIC operations respectively. Overall, this accelerator makes the entire GTC-P code faster by a factor of 2.5X. Our findings demonstrate that manual optimizations on the “intermediate” code are important for achieving significant improved performance of PIC applications on TaihuLight with OpenACC.

Key words: TaihuLight, GTC-P, particle-in-cell (PIC), Sunway, OpenACC