ISSN 1000-1239 CN 11-1777/TP

• Paper • Previous Articles     Next Articles

Error-Correcting Techniques for High-Performance Processors

Wang Zhen, Jiang Jianhui, and Yuan Chunxin   

  1. (Ministry of Education Key Laboratory of Embedded Systems and Service Computing, Tongji University, Shanghai 201804) (Department of Computer Science and Technology, Tongji University, Shanghai 201804) (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080)
  • Online:2008-02-15

Abstract: The downscaling of feature size of CMOS technology results in faster transistors and lower supply voltages. This trend contributes to the overall performance improvement of integrated circuits, but it also brings more challenges to the reliability of complex circuits like microprocessors. Accordingly, the fault-tolerance design of high-performance processors becomes more and more important. Till now much work has been done for error detection and correction in processors. Some novel fault tolerant microprocessor architectures are proposed recently, such as the simultaneously and redundantly threaded processors with recovery architecture. In this paper, a comprehensive survey on conventional and up-to-date error correction techniques for high-performance processors is given. A novel taxonomy is presented, by which the fault tolerant techniques for processors are categorized into clock-level error recovery, instruction-level error recovery, thread-level error recovery and reconfiguration. Many microarchitecture schemes, prototype systems and industrial products are analyzed and detailed fault tolerant strategies and schedule algorithms are compared. It is shown that for modern processors characterized by chip multiprocessor and/or simultaneous multithreading, the reliability is mostly improved by the fault-tolerance techniques based on inherent replicated hardware resources that are designed for improving performance.

Key words: high performance processor, error correction, error control code, redundancy, reconfiguration