Abstract:
Owing to the outstanding scalability of cluster systems, the demand of high performance can be easily met by increasing the number of nodes. But, with the scale of cluster system expanding, node failures become a commonplace feature of such large-scale systems. New ways are needed to accommodate the occurrence of node failure. As an important part of cluster operating system software, job scheduling completes the task of high efficient resource management and reasonable job scheduling. The function of job scheduling in cluster system is divided into two sub-processes: strategy of job selection and node allocation policy. In this paper, the LUNF (longest uptime node first) node allocation policy is introduced using characterization of nodes' failure. The simulation results show that LUNF policy do better than random node allocation policy for the system performance.