Abstract:
As one of the most important fault-tolerant techniques, coordinated checkpoint based rollback-recovery has been adopted in large scale parallel computer systems. Coordinating protocol and checkpoint image storage are two major factors that affect the overhead of parallel checkpointing systems. A novel application-transparent parallel checkpointing system implemented in MPICH2 is proposed. Compared with the existing techniques, the advantages of this system are summarized as follows: 1) Utilize the feature of near-neighbor communication in applications and virtual connection method to reduce the number of internal messages exchanged in coordinating stage, and hence to reduce the latency of protocol processing; 2) Store checkpoint images using Lustre file system to simplify the checkpoint files management; and 3) Implement parallel I/O in image storage stage to improve the system performance. Experiments suggest that the approach proposed results in low runtime overhead and enhances system scalability.