Abstract:
As the system performance of high performance computers (HPC) becomes higher and higher and its hardware scale continuously increases, how to realize highly reliable operation of the system is a great challenge in tera-scale and peta-scale HPC research and development. Beginning with the requirement for high reliability technology from HPC, the authors completely introduce the present reliability technologies in HPC hardware design, such as fault avoidance, static redundancy, dynamic redundancy, and online replacement, in which static redundancy includes such fault masking technologies as part redundancy, data path redundancy and information redundancy, and dynamic redundancy includes such reliability technologies as fault detection and diagnosis, reconstruction and recovery. Combined with online replacement technology, redundancy technology can greatly improve system RAS (reliability, availability, serviceability). Detailedly analyzed is the specific application of all kinds of reliability technologies in typical IBM, HP and Cray systems. Finally discussed is the future trend of reliability technology in peta-scale HPC, suggesting that in the development of peta-scale high performance computers, much work should focus on reliability design of multi-core processor and the all-round memory protection, and it is pointed out that blade architecture is beneficial to the realization of modularizational redundancy and online replacement of components.