• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Gao Jiangang, Hu Jin, Gong Daoyong, Fang Yanfei, Liu Xiao, He Wangquan, Jin Lifeng, Zheng Fang, Li Hongliang. Design and Analysis of Reliability and Availability on Sunway TaihuLight[J]. Journal of Computer Research and Development, 2021, 58(12): 2696-2707. DOI: 10.7544/issn1000-1239.2021.20200967
Citation: Gao Jiangang, Hu Jin, Gong Daoyong, Fang Yanfei, Liu Xiao, He Wangquan, Jin Lifeng, Zheng Fang, Li Hongliang. Design and Analysis of Reliability and Availability on Sunway TaihuLight[J]. Journal of Computer Research and Development, 2021, 58(12): 2696-2707. DOI: 10.7544/issn1000-1239.2021.20200967

Design and Analysis of Reliability and Availability on Sunway TaihuLight

More Information
  • Published Date: November 30, 2021
  • With the rapid growth of the system size and integration, the reliability and availability issues have become the major challenges to develop the exascale computer system. In the paper, the design and implementation of the reliability and availability on Sunway TaihuLight, a leadership-class supercomputer, are thoroughly analyzed. Firstly, the architecture of Sunway TaihuLight supercomputer is briefly described. Secondly, the reliability improvement techniques and the active and the passive fault tolerant techniques including the fault prediction, the active migration and the job local degradation are presented. Moreover, the fault tolerance system of multi-level active and passive collaboration is established on Sunway TaihuLight. Thirdly, the comprehensive failure distribution and the main sources of the failures are analyzed on the basis of the system failure statistics information. Specifically, combined with the three typical life cycle distribution, the exponential, the lognormal and the Weibull, the paper performs the data fitting analysis of the failure interval distribution on Sunway TaihuLight. The maximum likelihood estimation and the K-S(Kolmogorov Smirnov)test results indicate that the lognormal distribution fits the best with the failure empirical data. The failure distribution model of Sunway TaihuLight is established and the mean time between the failures of the system is calculated. Furthermore, the accuracy of the fault prediction is studied, and the performance as well as the time overhead of the fault tolerance techniques, such as the active migration and the job local degradation, is analyzed according to the system statistical results and the application tests. Finally, several instructive proposals to enhance the reliability and availability of the future exascale supercomputers are put forward based on the analysis of the reliability and availability on Sunway TaihuLight supercomputer.
  • Related Articles

    [1]Chen Juan, Hu Qingda, Chen Youmin, Lu Youyou, Shu Jiwu, Yang Xiaohui. A Tiny-Log Based Persistent Transactional Memory System[J]. Journal of Computer Research and Development, 2018, 55(9): 2029-2037. DOI: 10.7544/issn1000-1239.2018.20180294
    [2]Li Xiangnan, Zhang Guangyan, Li Qiang, Zheng Weimin. A Survey on the Approaches of Building Solid State Disk Arrays[J]. Journal of Computer Research and Development, 2016, 53(9): 1893-1905. DOI: 10.7544/issn1000-1239.2016.20150910
    [3]Liu Liangjiao, Xie Guoqi, Li Renfa, Yang Liu, Liu Yan. Dynamic Scheduling of Dual-Criticality Distributed Functionalities on Heterogeneous Systems[J]. Journal of Computer Research and Development, 2016, 53(6): 1186-1201. DOI: 10.7544/issn1000-1239.2016.20150175
    [4]Chen Zhiguang, Xiao Nong, Liu Fang, and Du Yimo. A High Performance Reliable Storage System Using HDDs as the Backup of SSDs[J]. Journal of Computer Research and Development, 2013, 50(1): 80-89.
    [5]Tian Hongbo, Zhang Xingjun, Zhao Xiaoyi, Dong Xiaoshe, and Wu Weiguo. Reliability and Performance Model of Tree-Structured Grid Services Based on Multivariate Exponential Distribution[J]. Journal of Computer Research and Development, 2011, 48(7): 1190-1201.
    [6]Huang Yongqin, Jin Lifeng, and Liu Yao. Current Situation and Trend of Reliability Technology in High Performance Computers[J]. Journal of Computer Research and Development, 2010, 47(4): 589-594.
    [7]Chen Gang, Zhang Weiwen, and Wu Guoxin. Replacement Solutions for Streaming Cache on P2P Network[J]. Journal of Computer Research and Development, 2007, 44(11): 1857-1865.
    [8]Zhou Xuehai, Yu Jie, Li Xi, and Wand Zhigang. Research on Reliability Evaluation of Cache Based on Instruction Behavior[J]. Journal of Computer Research and Development, 2007, 44(4): 553-559.
    [9]Li Jianjiang, Shu Jiwu, Chen Yongjian, Wang Dingxing, Zheng Weimin. A Mode for Developing OpenMP Programs Based on Dynamic Parallel Region[J]. Journal of Computer Research and Development, 2006, 43(3): 496-502.
    [10]Li Xiaorong, Shi Baile. ASGT: An Approach to Concurrency Control in Mobile Transaction Management Based on Prediction and Adaptation[J]. Journal of Computer Research and Development, 2006, 43(2): 295-300.

Catalog

    Article views (605) PDF downloads (399) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return