Abstract:
With the rapid growth of the system size and integration, the reliability and availability issues have become the major challenges to develop the exascale computer system. In the paper, the design and implementation of the reliability and availability on Sunway TaihuLight, a leadership-class supercomputer, are thoroughly analyzed. Firstly, the architecture of Sunway TaihuLight supercomputer is briefly described. Secondly, the reliability improvement techniques and the active and the passive fault tolerant techniques including the fault prediction, the active migration and the job local degradation are presented. Moreover, the fault tolerance system of multi-level active and passive collaboration is established on Sunway TaihuLight. Thirdly, the comprehensive failure distribution and the main sources of the failures are analyzed on the basis of the system failure statistics information. Specifically, combined with the three typical life cycle distribution, the exponential, the lognormal and the Weibull, the paper performs the data fitting analysis of the failure interval distribution on Sunway TaihuLight. The maximum likelihood estimation and the K-S(Kolmogorov Smirnov)test results indicate that the lognormal distribution fits the best with the failure empirical data. The failure distribution model of Sunway TaihuLight is established and the mean time between the failures of the system is calculated. Furthermore, the accuracy of the fault prediction is studied, and the performance as well as the time overhead of the fault tolerance techniques, such as the active migration and the job local degradation, is analyzed according to the system statistical results and the application tests. Finally, several instructive proposals to enhance the reliability and availability of the future exascale supercomputers are put forward based on the analysis of the reliability and availability on Sunway TaihuLight supercomputer.