Advanced Search
    Yi Huizhan, Wang Feng, Zuo Ke, Yang Canqun, Du Yunfei, Ma Yaqing. Asynchronous Checkpoint/Restart Based on Memory Buffer[J]. Journal of Computer Research and Development, 2014, 51(6): 1229-1239.
    Citation: Yi Huizhan, Wang Feng, Zuo Ke, Yang Canqun, Du Yunfei, Ma Yaqing. Asynchronous Checkpoint/Restart Based on Memory Buffer[J]. Journal of Computer Research and Development, 2014, 51(6): 1229-1239.

    Asynchronous Checkpoint/Restart Based on Memory Buffer

    • Since high-performance computer systems have an increasingly large amount of computing and memory resources, the reliability problem of systems is being deteriorated. Checkpoint/restart is one of the most representative techniques in fault tolerance. Since the performance of parallel file systems has increased much slowly compared with the size of memories and checkpoint files, the traditional checkpoint/restart technique has seriously affected the performance of applications. Utilizing the large amount of computing and memory resources, we present an asynchronous checkpoint/restart technique based on memory buffer. In the paper, we have divided the traditional checkpoint/restart technique into two steps: at the first step, the checkpoint files are saved to the local memory on each computing node in parallel; at the second step, a help task in each computing node is used to copy the checkpoint files from local memory to parallel file systems. The help task and computing tasks on each computing node are working asynchronously and in parallel. Due to the high bandwidth of local memory and the concurrent execution of the help task and computing tasks, the asynchronous technique significantly reduces the overhead of checkpoint/restart, and the conclusion is verified by the experiment results of simulation and real applications.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return