高级检索

    DCFT-Kernel:一种基于组服务的机群容错管理系统的设计与实现

    DCFT-Kernel: A Fault-Tolerant Cluster Middleware Based on Group Service

    • 摘要: 高可用和容错已经成为衡量机群系统(简称机群)的一个重要指标,随着机群的规模越来越庞大,如何实现大规模机群下的容错管理软件成为了技术难点.以传统分布式系统中的组通信技术为基础,采用将复杂的系统“分而治之”的思想,提出了组服务技术,可以解决容错管理软件的可扩展性和高可用性.同时,在组服务技术的基础上,结合实时的事件服务技术实现了一个大规模机群下的容错管理系统DCFT-Kernel,介绍了实现组服务和DCFT-Kernel的主要技术问题,并且对DCFT-Kernel的性能进行了分析.

       

      Abstract: Being highly available and fault-tolerant is one of the most important factors that are used for evaluating cluster system. But with the scale of cluster system becoming more and more larger, how to implement system software for fault-tolerant management in cluster becomes a difficult technical problem. In this paper, the group services method is put forward to resolve the problem of high scalability and high availability when implementing fault-tolerant management software. The main idea of group services is to divide the cluster system into several small partitions and let every partition being fault-tolerant upon that the whole system can be fault-tolerant. Using group services technology together with real-time event service technology, the fault-tolerant management system software, named DCFT-Kernel, is implemented in the DAWNING-4000A cluster system. In this paper, emphasis is put on describing the group services technology, but an introduction to DCFT-Kernel is also provided. Furthermore. some performance evaluations are also given in the paper.

       

    /

    返回文章
    返回