Abstract:
The downscaling of feature size of CMOS technology results in faster transistors and lower supply voltages. This trend contributes to the overall performance improvement of integrated circuits, but it also brings more challenges to the reliability of complex circuits like microprocessors. Accordingly, the fault-tolerance design of high-performance processors becomes more and more important. Till now much work has been done for error detection and correction in processors. Some novel fault tolerant microprocessor architectures are proposed recently, such as the simultaneously and redundantly threaded processors with recovery architecture. In this paper, a comprehensive survey on conventional and up-to-date error correction techniques for high-performance processors is given. A novel taxonomy is presented, by which the fault tolerant techniques for processors are categorized into clock-level error recovery, instruction-level error recovery, thread-level error recovery and reconfiguration. Many microarchitecture schemes, prototype systems and industrial products are analyzed and detailed fault tolerant strategies and schedule algorithms are compared. It is shown that for modern processors characterized by chip multiprocessor and/or simultaneous multithreading, the reliability is mostly improved by the fault-tolerance techniques based on inherent replicated hardware resources that are designed for improving performance.