Lightweight Error Recovery Techniques of Many-Core Processor in High Performance Computing
-
Graphical Abstract
-
Abstract
Due to the advances in semiconductor techniques, many-core processors with a large number of cores have been widely used in high-performance computing. Compared with multi-core processors, many-core processors can provide higher computing density and ratio of computation to power consumption. However, many-core processors must design more efficient fault tolerance mechanism to solve the serious reliability problem and alleviate performance degradation, while the cost of chip area and power must be low. In this paper, we present a prototype of home-grown many-core processor DFMC(deeply fused and heterogeneous many-core). Referring to the processor’s architecture and the applications related to the characters among cores, independent and coordinated lightweight error recovery techniques are proposed. When errors are detected, the related cores can roll back to consistent recovery line quickly by coordinated error recovery technique which is controlled by centralized unit and connected by coordinated recovery bus. To guarantee the applications’ performance, error recovery techniques are performed by instructions and recovery states are saved in cores. Our experimental results show that the effect of the techniques is significant, and the transient errors can be corrected by 80% with the chip area increased by 1.257%. The influences of lightweight error recovery techniques on applications performance, chip frequency and chip power consumption are very little. The techniques can improve the fault tolerant ability of the many-core processor.
-
-