高级检索
    王一超, 林新华, 蔡林金, Tang William, Ethier Stephane, 王蓓, 施忠伟, 松岗聪. 太湖之光上利用OpenACC移植和优化GTC-P[J]. 计算机研究与发展, 2018, 55(4): 875-884. DOI: 10.7544/issn1000-1239.2018.20160871
    引用本文: 王一超, 林新华, 蔡林金, Tang William, Ethier Stephane, 王蓓, 施忠伟, 松岗聪. 太湖之光上利用OpenACC移植和优化GTC-P[J]. 计算机研究与发展, 2018, 55(4): 875-884. DOI: 10.7544/issn1000-1239.2018.20160871
    Wang Yichao, Lin Xinhua, Cai Linjin, Tang William, Ethier Stephane, Wang Bei, See Simon, Satoshi Matsuoka. Porting and Optimizing GTC-P on TaihuLight Supercomputer with OpenACC[J]. Journal of Computer Research and Development, 2018, 55(4): 875-884. DOI: 10.7544/issn1000-1239.2018.20160871
    Citation: Wang Yichao, Lin Xinhua, Cai Linjin, Tang William, Ethier Stephane, Wang Bei, See Simon, Satoshi Matsuoka. Porting and Optimizing GTC-P on TaihuLight Supercomputer with OpenACC[J]. Journal of Computer Research and Development, 2018, 55(4): 875-884. DOI: 10.7544/issn1000-1239.2018.20160871

    太湖之光上利用OpenACC移植和优化GTC-P

    Porting and Optimizing GTC-P on TaihuLight Supercomputer with OpenACC

    • 摘要: 神威“太湖之光”是最新一期Top500榜单上排名第一的超级计算机,实测峰值性能约93PFLOPS.该系统提供了基于指导语句的并行编程工具OpenACC,兼容OpenACC 2.0编程标准,并添加了部分定制化功能.GTC-P是一个具有重要物理意义的科学应用,算法基于高性能计算领域中被广泛使用的PIC(particle-in-cell)方法.利用神威OpenACC并行编程模型在“太湖之光”上成功移植了GTC-P应用.在移植过程中,鉴于OpenACC编译器尚无法解决的性能瓶颈,提出了3种基于中间代码二次开发的优化方法:1)消除原子操作;2)避免低效的全局访存操作;3)手动添加SIMD intrinsics指令.实验结果表明,在64个从核上相比1个主核,优化后的函数charge和push分别实现了1.6倍和86倍的加速比,同时GTC-P代码整体取得了2.5倍的加速比.优化结果证明了基于中间代码的手动优化对利用神威OpenACC移植的PIC算法在“太湖之光”上的性能提升非常重要.

       

      Abstract: Sunway TaihuLight with its sustainable performance achieving 93PFLOPS is now the No.1 supercomputer in the latest Top500 list. It provides a high-level directive language called OpenACC that is compatible with OpenACC 2.0 standard with some customized extensions. GTC-P is a discovery-science-capable real-world application code based on the particle-in-cell (PIC) algorithm that is well-established in the HPC area. Our motivation is to port GTC-P code on TaihuLight supercomputer with OpenACC. Since the Sunway OpenACC compiler cannot deal with the performance bottleneck of GTC-P at present when it is directly ported onto TaihuLight, we have applied three optimizations on an “intermediate” version of the code generated by the compiler: 1) elimination of atomic operations; 2) avoidance of expensive global memory access instructions; 3) addition of SIMD intrinsics manually. The results from our numerical experiments show that these optimizations produce 1.6X and 8.6X speed-up on 64 CPE cores compared with a 1 MPE core for the key charge and push kernel PIC operations respectively. Overall, this accelerator makes the entire GTC-P code faster by a factor of 2.5X. Our findings demonstrate that manual optimizations on the “intermediate” code are important for achieving significant improved performance of PIC applications on TaihuLight with OpenACC.

       

    /

    返回文章
    返回