高级检索
    易会战, 王 锋, 左 克, 杨灿群, 杜云飞, 马亚青. 基于内存缓存的异步检查点容错技术[J]. 计算机研究与发展, 2014, 51(6): 1229-1239.
    引用本文: 易会战, 王 锋, 左 克, 杨灿群, 杜云飞, 马亚青. 基于内存缓存的异步检查点容错技术[J]. 计算机研究与发展, 2014, 51(6): 1229-1239.
    Yi Huizhan, Wang Feng, Zuo Ke, Yang Canqun, Du Yunfei, Ma Yaqing. Asynchronous Checkpoint/Restart Based on Memory Buffer[J]. Journal of Computer Research and Development, 2014, 51(6): 1229-1239.
    Citation: Yi Huizhan, Wang Feng, Zuo Ke, Yang Canqun, Du Yunfei, Ma Yaqing. Asynchronous Checkpoint/Restart Based on Memory Buffer[J]. Journal of Computer Research and Development, 2014, 51(6): 1229-1239.

    基于内存缓存的异步检查点容错技术

    Asynchronous Checkpoint/Restart Based on Memory Buffer

    • 摘要: 高性能计算机系统规模越来越大,系统可靠性问题越来越严重.检查点技术是最典型的容错方法,但是因为并行文件系统的性能提高相对缓慢,数据写带宽低,传统检查点方法产生了严峻的性能问题.针对当前计算机系统计算和存储资源丰富,而并行文件系统写带宽提高相对滞后的特点,提出了基于内存缓存的异步检查点容错技术,传统的检查点技术被划分为两步:检查点文件首先被缓存在计算结点的局部内存,然后使用一个独立的帮助任务将数据拷贝到并行文件系统.利用局部内存带宽高以及帮助任务和计算任务并行执行的特点,新方法极大减小了检查点容错引入的时间开销,模拟和实际程序测试验证了异步检查点容错技术的有效性.

       

      Abstract: Since high-performance computer systems have an increasingly large amount of computing and memory resources, the reliability problem of systems is being deteriorated. Checkpoint/restart is one of the most representative techniques in fault tolerance. Since the performance of parallel file systems has increased much slowly compared with the size of memories and checkpoint files, the traditional checkpoint/restart technique has seriously affected the performance of applications. Utilizing the large amount of computing and memory resources, we present an asynchronous checkpoint/restart technique based on memory buffer. In the paper, we have divided the traditional checkpoint/restart technique into two steps: at the first step, the checkpoint files are saved to the local memory on each computing node in parallel; at the second step, a help task in each computing node is used to copy the checkpoint files from local memory to parallel file systems. The help task and computing tasks on each computing node are working asynchronously and in parallel. Due to the high bandwidth of local memory and the concurrent execution of the help task and computing tasks, the asynchronous technique significantly reduces the overhead of checkpoint/restart, and the conclusion is verified by the experiment results of simulation and real applications.

       

    /

    返回文章
    返回