高性能计算机的可靠性技术现状与趋势

黄永勤  金利峰  刘  耀

高性能计算机的可靠性技术现状与趋势

黄永勤金利峰刘耀

Current Situation and Trend of Reliability Technology in High Performance Computers

Huang Yongqin, Jin Lifeng, and Liu Yao

摘要

摘要: 随着高性能计算机系统性能的不断提升和硬件规模的不断扩大，如何实现系统的可靠运行，是高性能计算机尤其是P级计算机研制中面临的重要技术挑战.从高性能计算机对可靠性技术的需求出发，全面介绍了高性能计算机硬件设计中的可靠性技术现状，包括避错、静态冗余、动态冗余和在线替换等技术，详细分析了各种可靠性技术在典型机器中的应用情况；最后对高性能计算机可靠性技术的发展趋势进行了深入探讨，包括多核处理器的可靠性设计、全方位的内存防护技术和刀片式的冗余架构.

Abstract: As the system performance of high performance computers (HPC) becomes higher and higher and its hardware scale continuously increases, how to realize highly reliable operation of the system is a great challenge in tera-scale and peta-scale HPC research and development. Beginning with the requirement for high reliability technology from HPC, the authors completely introduce the present reliability technologies in HPC hardware design, such as fault avoidance, static redundancy, dynamic redundancy, and online replacement, in which static redundancy includes such fault masking technologies as part redundancy, data path redundancy and information redundancy, and dynamic redundancy includes such reliability technologies as fault detection and diagnosis, reconstruction and recovery. Combined with online replacement technology, redundancy technology can greatly improve system RAS (reliability, availability, serviceability). Detailedly analyzed is the specific application of all kinds of reliability technologies in typical IBM, HP and Cray systems. Finally discussed is the future trend of reliability technology in peta-scale HPC, suggesting that in the development of peta-scale high performance computers, much work should focus on reliability design of multi-core processor and the all-round memory protection, and it is pointed out that blade architecture is beneficial to the realization of modularizational redundancy and online replacement of components.

HTML全文

参考文献(0)

施引文献

资源附件(0)