Liang Yi, Wang Lei, Fan Jianping, Fang Juan. Research on the Shared Memory-Based Checkpointing for Cluster Services[J]. Journal of Computer Research and Development, 2010, 47(4): 571-580.
Citation:
Liang Yi, Wang Lei, Fan Jianping, Fang Juan. Research on the Shared Memory-Based Checkpointing for Cluster Services[J]. Journal of Computer Research and Development, 2010, 47(4): 571-580.
Liang Yi, Wang Lei, Fan Jianping, Fang Juan. Research on the Shared Memory-Based Checkpointing for Cluster Services[J]. Journal of Computer Research and Development, 2010, 47(4): 571-580.
Citation:
Liang Yi, Wang Lei, Fan Jianping, Fang Juan. Research on the Shared Memory-Based Checkpointing for Cluster Services[J]. Journal of Computer Research and Development, 2010, 47(4): 571-580.
1(College of Computer Science, Beijing University of Technology, Beijing 100124) 2(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190) 3(Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518054)
To overcome the defects of the relative low performance cost ratio caused by the secondary storage-based checkpointing for cluster services, a shared memory-based checkpointing mechanism for cluster services is presented in this paper. Idea of the proposed checkpointing mechanism is to make the checkpointing based on the shared memory, so as to reduce the checkpointing and recovery latency compared with the secondary storage-based checkpointing. To lower the risk of the non-persistent storage with the shared memory, in the shared memory-based checkpointing mechanism, all checkpoint servers in the cluster are organized as a single-directed circle. For each cluster service, the checkpoint data is stored both on the local checkpoint server and its predecessor in the single-directed checkpoint circle. The checkpoint management protocol is designed for the dual-stored checkpoint data to ensure the checkpointing update consistency. A group membership protocol is presented to guarantee all members in the single-directed checkpoint circle having the consistent group view, so as to backup the checkpoint data correctly. The experiment results show that the shared memory-based checkpointing mechanism achieves lower checkpointing and recovery latency. The group membership protocol needs only one-round communication to achieve the group view consistency among all checkpoint servers, hence costing low communication overhead.