高级检索
    毛安琪, 汤小春, 丁朝, 李战怀. 集中式集群资源调度框架的可扩展性优化[J]. 计算机研究与发展, 2021, 58(3): 497-512. DOI: 10.7544/issn1000-1239.2021.20200501
    引用本文: 毛安琪, 汤小春, 丁朝, 李战怀. 集中式集群资源调度框架的可扩展性优化[J]. 计算机研究与发展, 2021, 58(3): 497-512. DOI: 10.7544/issn1000-1239.2021.20200501
    Mao Anqi, Tang Xiaochun, Ding Zhao, Li Zhanhuai. Scalability for Monolithic Schedulers of Cluster Resource Management Framework[J]. Journal of Computer Research and Development, 2021, 58(3): 497-512. DOI: 10.7544/issn1000-1239.2021.20200501
    Citation: Mao Anqi, Tang Xiaochun, Ding Zhao, Li Zhanhuai. Scalability for Monolithic Schedulers of Cluster Resource Management Framework[J]. Journal of Computer Research and Development, 2021, 58(3): 497-512. DOI: 10.7544/issn1000-1239.2021.20200501

    集中式集群资源调度框架的可扩展性优化

    Scalability for Monolithic Schedulers of Cluster Resource Management Framework

    • 摘要: 集中式集群资源管理系统既能够确保全局资源状态的一致性亦拥有多种调度模型, 因此被广泛应用于实际系统中.但是, 当集中式资源管理器在接收并处理大规模的周期性心跳信息时, 由于其采用单一节点来维护全局资源状态, 所以资源管理器的负载压力急剧增加, 导致调度能力降低, 影响了集群系统的可扩展性.针对上述问题, 提出一种“没有变化就不更新”的思想, 取代集中资源管理的定时更新机制, 改善了集中式资源管理系统的可扩展性.首先, 通过计算节点引入基于差分的心跳信息处理模型, 使得未发生状态变化的节点不必发送心跳消息, 从而减少消息发送的规模和次数; 其次, 针对节点宕机监测过程, 提出基于环形监视的节点监控模型, 让各个计算节点之间互相监视对方的宕机状态, 从而将周期性监测压力转移到计算节点; 最后, 给出这2种模型在集中式资源管理系统YARN上的实现, 并针对改进前后的系统进行实验测试.通过实验验证, 当集群达到1万个节点且心跳时间间隔3 s时, 改进后YARN系统的心跳信息处理效率以及资源更新效率相比原YARN系统提高40%左右.另外, 改进后YARN系统管理集群节点规模相比原YARN系统扩大1.88倍以上.

       

      Abstract: The significant advantages of monolithic cluster resource management system in ensuring the consistency of global resource status and applying multiple scheduling models make it widely used in actual systems. Howerver, the performance of the monolithic resource manager in a large cluster management environment does not meet expectations, because it uses a single node to maintain the global resource state. When the resource manager is receiving and processing large-scale periodic heartbeat information, the load pressure on the resource manager will increase sharply, which leads to a scalability bottleneck. In order to solve these problems, this paper proposes the idea of “no change, no update” to replace the periodic update mechanism of the resource manager. In our paper, we briefly summarize three main topics. Firstly, we introduce a differential-based heartbeat information processing model in the computing node. When the resource status of the computing node has not changed, it will not send the message to the resource manager, thereby reducing the size and number of messages. Secondly, we propose a ring network monitoring model between computing nodes. By adopting this mode, the periodic monitoring pressure can be transferred to the computing nodes. Finally, we implement these two models on YARN. After experimental verification, we can conclude that when the cluster reaches 10 000 nodes and the heartbeat interval is 3 s, the YARN based on our models increases the heartbeat information processing efficiency and resource update efficiency by about 40%. In addition, the scale of the cluster managed by improved YARN is more than 1.88 times that of the original YARN.

       

    /

    返回文章
    返回