Asynchronous Checkpoint/Restart Based on Memory Buffer
-
-
Abstract
Since high-performance computer systems have an increasingly large amount of computing and memory resources, the reliability problem of systems is being deteriorated. Checkpoint/restart is one of the most representative techniques in fault tolerance. Since the performance of parallel file systems has increased much slowly compared with the size of memories and checkpoint files, the traditional checkpoint/restart technique has seriously affected the performance of applications. Utilizing the large amount of computing and memory resources, we present an asynchronous checkpoint/restart technique based on memory buffer. In the paper, we have divided the traditional checkpoint/restart technique into two steps: at the first step, the checkpoint files are saved to the local memory on each computing node in parallel; at the second step, a help task in each computing node is used to copy the checkpoint files from local memory to parallel file systems. The help task and computing tasks on each computing node are working asynchronously and in parallel. Due to the high bandwidth of local memory and the concurrent execution of the help task and computing tasks, the asynchronous technique significantly reduces the overhead of checkpoint/restart, and the conclusion is verified by the experiment results of simulation and real applications.
-
-