神威太湖之光可靠性及可用性设计与分析

高剑刚; 胡晋; 龚道永; 方燕飞; 刘骁; 何王全; 金利峰; 郑方; 李宏亮

doi:10.7544/issn1000-1239.2021.20200967

神威太湖之光可靠性及可用性设计与分析

Design and Analysis of Reliability and Availability on Sunway TaihuLight

摘要

摘要: 随着系统规模与集成度的快速增加，可靠性与可用性问题成为构建E级计算机系统所面临的重大挑战.针对神威太湖之光超级计算机可靠性与可用性设计与实现开展全面的分析.首先概要描述神威太湖之光超级计算机系统结构.其次，系统提出神威太湖之光超级计算机可靠性增强技术以及故障预测、主动迁移、任务局部降级等主被动容错技术，建立神威太湖之光超级计算机多层次主被动协同的容错系统.再次，根据系统故障统计信息，分析失效分布及主要失效来源，结合指数、对数正态与韦布尔3种典型寿命周期分布，对神威太湖之光系统故障间隔时间分布进行数据拟合分析.最大似然估计与K-S(Kolmogorov Smirnov)检验结果表明，对数正态分布与系统失效经验数据取得了最好的拟合度，建立神威太湖之光系统失效分布模型，并计算得出系统平均无故障时间.通过系统运行统计与实际应用测试，分析了故障预测精确度以及主动迁移、局部降低等容错技术的时间开销与容错效果.最后，在神威太湖之光超级计算机可靠性与可用性分析的基础上，提出E级计算机系统高可靠与高可用技术发展建议.

Abstract: With the rapid growth of the system size and integration, the reliability and availability issues have become the major challenges to develop the exascale computer system. In the paper, the design and implementation of the reliability and availability on Sunway TaihuLight, a leadership-class supercomputer, are thoroughly analyzed. Firstly, the architecture of Sunway TaihuLight supercomputer is briefly described. Secondly, the reliability improvement techniques and the active and the passive fault tolerant techniques including the fault prediction, the active migration and the job local degradation are presented. Moreover, the fault tolerance system of multi-level active and passive collaboration is established on Sunway TaihuLight. Thirdly, the comprehensive failure distribution and the main sources of the failures are analyzed on the basis of the system failure statistics information. Specifically, combined with the three typical life cycle distribution, the exponential, the lognormal and the Weibull, the paper performs the data fitting analysis of the failure interval distribution on Sunway TaihuLight. The maximum likelihood estimation and the K-S(Kolmogorov Smirnov)test results indicate that the lognormal distribution fits the best with the failure empirical data. The failure distribution model of Sunway TaihuLight is established and the mean time between the failures of the system is calculated. Furthermore, the accuracy of the fault prediction is studied, and the performance as well as the time overhead of the fault tolerance techniques, such as the active migration and the job local degradation, is analyzed according to the system statistical results and the application tests. Finally, several instructive proposals to enhance the reliability and availability of the future exascale supercomputers are put forward based on the analysis of the reliability and availability on Sunway TaihuLight supercomputer.

HTML全文

参考文献(0)

施引文献

资源附件(0)