Advanced Search
    Xie Min, Lu Yutong, Zhou Enqiang, Cao Hongjia, and Yang Xuejun. Implementation and Evaluation of MPI Checkpointing System over Lustre File System[J]. Journal of Computer Research and Development, 2007, 44(10): 1709-1716.
    Citation: Xie Min, Lu Yutong, Zhou Enqiang, Cao Hongjia, and Yang Xuejun. Implementation and Evaluation of MPI Checkpointing System over Lustre File System[J]. Journal of Computer Research and Development, 2007, 44(10): 1709-1716.

    Implementation and Evaluation of MPI Checkpointing System over Lustre File System

    • As one of the most important fault-tolerant techniques, coordinated checkpoint based rollback-recovery has been adopted in large scale parallel computer systems. Coordinating protocol and checkpoint image storage are two major factors that affect the overhead of parallel checkpointing systems. A novel application-transparent parallel checkpointing system implemented in MPICH2 is proposed. Compared with the existing techniques, the advantages of this system are summarized as follows: 1) Utilize the feature of near-neighbor communication in applications and virtual connection method to reduce the number of internal messages exchanged in coordinating stage, and hence to reduce the latency of protocol processing; 2) Store checkpoint images using Lustre file system to simplify the checkpoint files management; and 3) Implement parallel I/O in image storage stage to improve the system performance. Experiments suggest that the approach proposed results in low runtime overhead and enhances system scalability.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return