ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (6): 1271-1280.doi: 10.7544/issn1000-1239.2016.20148445

• 软件技术 • 上一篇    下一篇

基于排队理论的动态任务调度模型及容错

何王全1,魏迪1,权建校1,吴伟1,漆锋滨2   

  1. 1(江南计算技术研究所 江苏无锡 214083);2(国家并行计算机工程技术研究中心 北京 100080) (wangquan_he@163.com)
  • 出版日期: 2016-06-01
  • 基金资助: 
    国家“八六三”高技术研究发展计划基金项目(2012AA010903);计算机体系结构国家重点实验室基金项目(CARCH201403)

Dynamic Task Scheduling Model and Fault-Tolerant via Queuing Theory

He Wangquan1, Wei Di1, Quan Jianxiao1, Wu Wei1, Qi Fengbin2   

  1. 1(Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214083);2(National Research Center of Parallel Computer Engineering & Technology, Beijing 100080)
  • Online: 2016-06-01

摘要: 高效的动态任务调度和容错机制是高性能计算面临的挑战之一,已有的方法难以高效扩展到大规模环境.针对该问题,提出了基于N层排队理论的高可扩展动态任务调度模型,为程序员提供简洁的并行编程框架,有效降低了编程负担;使用泊松过程相关理论分析了任务申请的平均等待时间,通过给定的阈值进行决策分层;结合局部感知的轻量级降级模型,可有效降低大规模并行课题的容错开销,提高系统的可用性.Micro Benchmark在神威蓝光32768核环境下测试表明,对于平均执行时间为3.4s的短任务,基于N层排队理论的动态任务调度模型可扩展性很好,调度开销是传统模型的7.2%;药物软件DOCK在16384核环境下的整体性能比该软件原有的任务调度提升34.3%;局部感知的轻量级降级模型具有故障后损失小的特点,DOCK的测试表明比传统容错方法执行时间减少3.75%~5.13%.

关键词: 排队理论, 动态任务调度, 编程框架, 容错, 轻量级降级

Abstract: The design of efficient dynamic task scheduling and fault-tolerant mechanism is an issue of crucial importance in high-performance computing field. Most existing methods, however, can hardly achieve good scalability on large-scale system. In this paper, we propose a scalable dynamic task scheduling model via N-level queuing theory, which dramatically reduces the programming burden by providing programmer with concise parallel programming framework. On one hand, we utilize the Poisson process theory to analyze the average wait time of tasks, and then decide the task layers according to threshold. On the other hand, we reduce the fault tolerance overhead using region-aware light-weight degradation model. Experimental results with Micro Benchmark on Bluelight system with 32768 cores show that our method achieves good scalability when the tasks take 3.4s on average and the overhead is just 7.2% of traditional model. Running on 16384 cores, pharmacological application DOCK achieves performance improvement by 34.3% with our scheduling. Moreover, the results of DOCK show our fault-tolerant model achieves 3.75%~5.13% performance improvements over traditional mechanism.

Key words: queuing theory, dynamic task scheduling, programming framework, fault-tolerant, light-weight degradation

中图分类号: