Abstract:
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on parallel computing. Two bottlenecks, checkpointing protocol overhead and storage cost of checkpoint image, limit the scalability of checkpoint system, which is critical to large-scale clusters. To address these issues, the design of C system is presented which provides coordinated checkpointing based on dynamic virtual connection and distributed checkpoint image storage for MPI-based parallel applications. Full use is made of some characteristics of parallel applications and capability of local disks of cluster system to reduce checkpointing cost of large scale parallel job. C system is suitable to large scale cluster and initial experimental results show negligible performance impact due to the incorporation of the mechanism into the C system implemented on the cluster testbed.