基于Lustre文件系统的MPI检查点系统实现技术与性能测试

谢    卢宇彤  周恩强  曹宏嘉  杨学军

基于Lustre文件系统的MPI检查点系统实现技术与性能测试

谢  卢宇彤周恩强曹宏嘉杨学军

(国防科学技术大学计算机学院长沙 410073) (xmxmxie@gmail.com)

计量
- 文章访问数: 677
- HTML全文浏览量: 2
- PDF下载量: 486
出版历程
- 发布日期: 2007-10-14

Implementation and Evaluation of MPI Checkpointing System over Lustre File System

Xie Min, Lu Yutong, Zhou Enqiang, Cao Hongjia, and Yang Xuejun

(School of Computer Science, National University of Defense Technology, Changsha 410073)

摘要

摘要: 基于协同式检查点的回卷恢复是在大规模并行计算机系统中得到采用的一项重要容错技术，其性能开销主要为协同协议和检查点映像存储所决定.描述了一个在MPICH2中实现的应用透明的并行检查点系统，相比已有的技术，该系统有以下特点：1) 协同协议操作利用了并行应用的近邻通信特性，通过虚连接方法减少协议的处理开销；2) 采用Lustre文件系统简化检查点映像文件管理的复杂性；3) 通过并行I/O操作提高性能，优化检查点映像的存储过程.实际应用的测试表明，该检查点系统具有较小的运行时间开销和良好的可扩展性.
- 容错技术 /
- MPICH2 /
- 回卷恢复 /
- 协同式检查点 /
- Lustre文件系统
Abstract: As one of the most important fault-tolerant techniques, coordinated checkpoint based rollback-recovery has been adopted in large scale parallel computer systems. Coordinating protocol and checkpoint image storage are two major factors that affect the overhead of parallel checkpointing systems. A novel application-transparent parallel checkpointing system implemented in MPICH2 is proposed. Compared with the existing techniques, the advantages of this system are summarized as follows: 1) Utilize the feature of near-neighbor communication in applications and virtual connection method to reduce the number of internal messages exchanged in coordinating stage, and hence to reduce the latency of protocol processing; 2) Store checkpoint images using Lustre file system to simplify the checkpoint files management; and 3) Implement parallel I/O in image storage stage to improve the system performance. Experiments suggest that the approach proposed results in low runtime overhead and enhances system scalability.
- fault-tolerant /
- MPICH2 /
- rollback-recovery /
- coordinated checkpoint /
- Lustre file system