Advanced Search
    Jia Jia, Yang Xuejun, Li Zhiling. A Redundancy-Multithread-Based Multiple GPU Copies Fault-Tolerance Technique[J]. Journal of Computer Research and Development, 2013, 50(7): 1551-1562.
    Citation: Jia Jia, Yang Xuejun, Li Zhiling. A Redundancy-Multithread-Based Multiple GPU Copies Fault-Tolerance Technique[J]. Journal of Computer Research and Development, 2013, 50(7): 1551-1562.

    A Redundancy-Multithread-Based Multiple GPU Copies Fault-Tolerance Technique

    • With the increasing of GPGPU's performance, heterogeneous systems that consist of CPUs and GPUs are becoming attractive research hotspots in high-performance computing fields. However, as higher performance is achieved, lower reliability becomes the bottleneck of parallel computing systems that scales up to large size. Since commercial GPGPUs have low fault-tolerance ability, the reliability problem is very acute and lack of practical fault-tolerance solutions in CPU-GPU heterogeneous systems. To address this problem, this paper proposes a redundancy-multithread-based multiple GPU copies fault-tolerance technique: RB-TMR. Towards the programming model of heterogeneous system and the characterization of application, detailed realization and the compiling framework of this fault-tolerance technique for heterogeneous systems are given. In experiments, 10 cases are performed to evaluate this technique's performance, and the results demonstrated that this technique exhibits a novel direction to study fault-tolerance techniques in heterogeneous systems.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return