高级检索
    陈璟锟, 杜云飞. 地球科学大规模并行应用的重叠存储优化[J]. 计算机研究与发展, 2019, 56(4): 790-797. DOI: 10.7544/issn1000-1239.2019.20170906
    引用本文: 陈璟锟, 杜云飞. 地球科学大规模并行应用的重叠存储优化[J]. 计算机研究与发展, 2019, 56(4): 790-797. DOI: 10.7544/issn1000-1239.2019.20170906
    Chen Jingkun, Du Yunfei. An Overlap Store Optimization for Large-Scale Parallel Earth Science Application[J]. Journal of Computer Research and Development, 2019, 56(4): 790-797. DOI: 10.7544/issn1000-1239.2019.20170906
    Citation: Chen Jingkun, Du Yunfei. An Overlap Store Optimization for Large-Scale Parallel Earth Science Application[J]. Journal of Computer Research and Development, 2019, 56(4): 790-797. DOI: 10.7544/issn1000-1239.2019.20170906

    地球科学大规模并行应用的重叠存储优化

    An Overlap Store Optimization for Large-Scale Parallel Earth Science Application

    • 摘要: 存储是地球科学类软件的重要组成部分,周期性输出中间态和检查点会带来大量的访存操作,不恰当的访存设计会严重影响软件在大规模计算时的性能表现.针对地球科学类软件的存储问题,从软件层面提出一个重叠存储优化方法,通过设置额外的I/O进程隐藏输出过程.该重叠存储优化主要有3个优势:1)将输出和计算操作重叠在一起,实现了输出的重叠化和隐藏化;2)抑制了收集通信的开销,突破了收集操作的通信带宽瓶颈和内存限制;3)能容易地使用各种高级并行输出库函数.利用重叠存储优化了天河二号上的WRF,ROMS_AGRIF,GRAPES,并完成了性能测试.结果表明:经过存储优化后,程序的峰值性能都获得了显著的提升.还讨论了在固定总进程数下,计算进程和I/O进程数的最佳比例是多少.优化后的程序与原版相比,模式专家只需要在配置文件额外设置2个新变量即可使用,十分易于学习.

       

      Abstract: Weather forecast, atmosphere or ocean simulations have much output data during the iterative computation for the intermediate status or check point. However, an unreasonable output design limits the performance of the earth science application in large-scale parallel computation. In this paper, we propose an overlap store optimization to solve this problem. The key issue of this overlap store optimization is setting some I/O processes to hide the I/O cost. This optimization has three main advantages: first, we hide the I/O operation through the overlap of output and computing; second, we limit the cost of gather operation, break though the bottleneck of gather communication bandwidth and memory size; third, the I/O process is flexible to use different high-performance parallel I/O API. We use this method to optimize WRF, ROMS_AGRIF and GRAPES in Tianhe II super computer, and test their performance after the optimization. The result of the tests shows that we obtain about 30% to 900% improvement in the peak. We also discuss the best proportion of computer process and I/O process when the total number of processes is fixed. The optimized version is very easy to used, and the only cost is the scientists need to setup two more variables in the namelist.

       

    /

    返回文章
    返回