基于排队理论的动态任务调度模型及容错

何王全; 魏迪; 权建校; 吴伟; 漆锋滨

doi:10.7544/issn1000-1239.2016.20148445

基于排队理论的动态任务调度模型及容错

Dynamic Task Scheduling Model and Fault-Tolerant via Queuing Theory

摘要

摘要: 高效的动态任务调度和容错机制是高性能计算面临的挑战之一，已有的方法难以高效扩展到大规模环境.针对该问题，提出了基于N层排队理论的高可扩展动态任务调度模型，为程序员提供简洁的并行编程框架，有效降低了编程负担；使用泊松过程相关理论分析了任务申请的平均等待时间，通过给定的阈值进行决策分层；结合局部感知的轻量级降级模型，可有效降低大规模并行课题的容错开销，提高系统的可用性.Micro Benchmark在神威蓝光32768核环境下测试表明，对于平均执行时间为3.4s的短任务，基于N层排队理论的动态任务调度模型可扩展性很好，调度开销是传统模型的7.2%；药物软件DOCK在16384核环境下的整体性能比该软件原有的任务调度提升34.3%；局部感知的轻量级降级模型具有故障后损失小的特点，DOCK的测试表明比传统容错方法执行时间减少3.75%～5.13%.

Abstract: The design of efficient dynamic task scheduling and fault-tolerant mechanism is an issue of crucial importance in high-performance computing field. Most existing methods, however, can hardly achieve good scalability on large-scale system. In this paper, we propose a scalable dynamic task scheduling model via N-level queuing theory, which dramatically reduces the programming burden by providing programmer with concise parallel programming framework. On one hand, we utilize the Poisson process theory to analyze the average wait time of tasks, and then decide the task layers according to threshold. On the other hand, we reduce the fault tolerance overhead using region-aware light-weight degradation model. Experimental results with Micro Benchmark on Bluelight system with 32768 cores show that our method achieves good scalability when the tasks take 3.4s on average and the overhead is just 7.2% of traditional model. Running on 16384 cores, pharmacological application DOCK achieves performance improvement by 34.3% with our scheduling. Moreover, the results of DOCK show our fault-tolerant model achieves 3.75%~5.13% performance improvements over traditional mechanism.

HTML全文

参考文献(0)

施引文献

资源附件(0)