一个适合大规模集群并行计算的检查点系统

Implementation of Checkpoint System Towards Large Scale Parallel Computing

摘要: 分布式检查点系统是大规模并行计算系统容错的重要手段.协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈.针对并行应用程序的执行特征和高性能集群的体系结构特点，C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销，增强检查点系统的可伸缩性.初步测试结果表明，C系统的设计策略适合大规模并行计算的容错.

Abstract: As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on parallel computing. Two bottlenecks, checkpointing protocol overhead and storage cost of checkpoint image, limit the scalability of checkpoint system, which is critical to large-scale clusters. To address these issues, the design of C system is presented which provides coordinated checkpointing based on dynamic virtual connection and distributed checkpoint image storage for MPI-based parallel applications. Full use is made of some characteristics of parallel applications and capability of local disks of cluster system to reduce checkpointing cost of large scale parallel job. C system is suitable to large scale cluster and initial experimental results show negligible performance impact due to the incorporation of the mechanism into the C system implemented on the cluster testbed.