Abstract:
Being highly available and fault-tolerant is one of the most important factors that are used for evaluating cluster system. But with the scale of cluster system becoming more and more larger, how to implement system software for fault-tolerant management in cluster becomes a difficult technical problem. In this paper, the group services method is put forward to resolve the problem of high scalability and high availability when implementing fault-tolerant management software. The main idea of group services is to divide the cluster system into several small partitions and let every partition being fault-tolerant upon that the whole system can be fault-tolerant. Using group services technology together with real-time event service technology, the fault-tolerant management system software, named DCFT-Kernel, is implemented in the DAWNING-4000A cluster system. In this paper, emphasis is put on describing the group services technology, but an introduction to DCFT-Kernel is also provided. Furthermore. some performance evaluations are also given in the paper.