Advanced Search
    Xie Min, Lu Yutong, Zhou Enqiang, Cao Hongjia, and Yang Xuejun. Implementation and Evaluation of MPI Checkpointing System over Lustre File SystemJ. Journal of Computer Research and Development, 2007, 44(10): 1709-1716.
    Citation: Xie Min, Lu Yutong, Zhou Enqiang, Cao Hongjia, and Yang Xuejun. Implementation and Evaluation of MPI Checkpointing System over Lustre File SystemJ. Journal of Computer Research and Development, 2007, 44(10): 1709-1716.

    Implementation and Evaluation of MPI Checkpointing System over Lustre File System

    • As one of the most important fault-tolerant techniques, coordinated checkpoint based rollback-recovery has been adopted in large scale parallel computer systems. Coordinating protocol and checkpoint image storage are two major factors that affect the overhead of parallel checkpointing systems. A novel application-transparent parallel checkpointing system implemented in MPICH2 is proposed. Compared with the existing techniques, the advantages of this system are summarized as follows: 1) Utilize the feature of near-neighbor communication in applications and virtual connection method to reduce the number of internal messages exchanged in coordinating stage, and hence to reduce the latency of protocol processing; 2) Store checkpoint images using Lustre file system to simplify the checkpoint files management; and 3) Implement parallel I/O in image storage stage to improve the system performance. Experiments suggest that the approach proposed results in low runtime overhead and enhances system scalability.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return