LUNF——基于节点失效特征的机群作业调度策略

武林平; 孟  丹; 梁  毅; 涂碧波; 王  磊

LUNF——基于节点失效特征的机群作业调度策略

LUNF—A Cluster Job Scheduling Strategy Using Characterization of Nodes' Failure

摘要

摘要: 良好的可扩展性使得人们可通过扩大机群系统的规模来达到所需要的计算能力，但随着机群系统节点数目的增多，节点失效对机群系统性能的影响已经成为大规模机群系统使用过程中一个不可回避的问题.机群作业调度作为机群操作系统软件的重要组成部分，完成高效资源管理与合理作业调度，机群作业调度系统功能上可分为作业选取策略和节点分配策略两部分.结合机群系统节点失效的特征，提出了正常运行时间最长节点优先(longest uptime node first, LUNF)的节点分配策略.仿真结果表明，相对于节点随机分配策略，LUNF策略的作业平均响应时间与作业平均slowdown降低10%左右.

Abstract: Owing to the outstanding scalability of cluster systems, the demand of high performance can be easily met by increasing the number of nodes. But, with the scale of cluster system expanding, node failures become a commonplace feature of such large-scale systems. New ways are needed to accommodate the occurrence of node failure. As an important part of cluster operating system software, job scheduling completes the task of high efficient resource management and reasonable job scheduling. The function of job scheduling in cluster system is divided into two sub-processes: strategy of job selection and node allocation policy. In this paper, the LUNF (longest uptime node first) node allocation policy is introduced using characterization of nodes' failure. The simulation results show that LUNF policy do better than random node allocation policy for the system performance.

HTML全文

参考文献(0)

施引文献

资源附件(0)