Abstract:
To overcome the defects of the relative low performance cost ratio caused by the secondary storage-based checkpointing for cluster services, a shared memory-based checkpointing mechanism for cluster services is presented in this paper. Idea of the proposed checkpointing mechanism is to make the checkpointing based on the shared memory, so as to reduce the checkpointing and recovery latency compared with the secondary storage-based checkpointing. To lower the risk of the non-persistent storage with the shared memory, in the shared memory-based checkpointing mechanism, all checkpoint servers in the cluster are organized as a single-directed circle. For each cluster service, the checkpoint data is stored both on the local checkpoint server and its predecessor in the single-directed checkpoint circle. The checkpoint management protocol is designed for the dual-stored checkpoint data to ensure the checkpointing update consistency. A group membership protocol is presented to guarantee all members in the single-directed checkpoint circle having the consistent group view, so as to backup the checkpoint data correctly. The experiment results show that the shared memory-based checkpointing mechanism achieves lower checkpointing and recovery latency. The group membership protocol needs only one-round communication to achieve the group view consistency among all checkpoint servers, hence costing low communication overhead.