一种基于冗余线程的GPU多副本容错技术
A Redundancy-Multithread-Based Multiple GPU Copies Fault-Tolerance Technique
-
摘要: 目前随着通用GPU(general purpose computation on graphic processing units, GPGPU)性能的不断提高,利用CPU和GPU构建的异构系统已经成为高性能计算领域的研究热点.然而随着并行计算系统的不断增长,系统可靠性越来越低,已成为并行计算向大规模扩展的一个不容忽视的制约因素.由于商用GPGPU容错能力较弱,所以由CPU和GPU构建的大规模异构并行系统的可靠性问题更为尖锐,尚缺乏实用的容错手段,针对这一现实问题提出了一种基于冗余线程的GPU多副本容错技术:RB-TMR(Rollback TMR),同时根据异构系统的编程模型及程序特征对这一面向异构系统的容错机制的设计实现及其编译框架进行了具体分析和描述.最后通过10个案例对此技术进行了实现并评估了其性能.这一技术为异构系统的容错技术研究提供了新的思路,具有重大意义.Abstract: With the increasing of GPGPU's performance, heterogeneous systems that consist of CPUs and GPUs are becoming attractive research hotspots in high-performance computing fields. However, as higher performance is achieved, lower reliability becomes the bottleneck of parallel computing systems that scales up to large size. Since commercial GPGPUs have low fault-tolerance ability, the reliability problem is very acute and lack of practical fault-tolerance solutions in CPU-GPU heterogeneous systems. To address this problem, this paper proposes a redundancy-multithread-based multiple GPU copies fault-tolerance technique: RB-TMR. Towards the programming model of heterogeneous system and the characterization of application, detailed realization and the compiling framework of this fault-tolerance technique for heterogeneous systems are given. In experiments, 10 cases are performed to evaluate this technique's performance, and the results demonstrated that this technique exhibits a novel direction to study fault-tolerance techniques in heterogeneous systems.