• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Gao Jiangang, Zheng Yan, Yu Kang, Peng Dajia, Li Hongliang, Liu Yong, He Wangquan, Chen Dexun, Wang Fei. Runtime Fault Location Method for Sunway Supercomputer[J]. Journal of Computer Research and Development, 2024, 61(1): 86-97. DOI: 10.7544/issn1000-1239.202220821
Citation: Gao Jiangang, Zheng Yan, Yu Kang, Peng Dajia, Li Hongliang, Liu Yong, He Wangquan, Chen Dexun, Wang Fei. Runtime Fault Location Method for Sunway Supercomputer[J]. Journal of Computer Research and Development, 2024, 61(1): 86-97. DOI: 10.7544/issn1000-1239.202220821

Runtime Fault Location Method for Sunway Supercomputer

More Information
  • Author Bio:

    Gao Jiangang: born in 1963. Master, professor. Senior member of CCF. His main research interests include high performance computing and computer architecture

    Zheng Yan: born in 1977. Bachelor, professor. His main research interests include high performance computing and operating system

    Yu Kang: born in 1987. PhD, assistant professor. His main research interests include parallel computing and runtime system

    Peng Dajia: born in 1992. PhD candidate, engineer. His main research interests include parallel computing and debugging

    Li Hongliang: born in 1975. PhD, professor. His main research interests include high performance computing and computer architecture

    Liu Yong: born in 1981. PhD, associate professor. His main research interest includes parallel algorithms

    He Wangquan: born in 1981. PhD, professor. His main research interests include high performance computing software and architecture

    Chen Dexun: born in 1973. PhD, professor. His main research interests include high performance computing software and architecture

    Wang Fei: born in 1981. Master, professor. His main research interests include high performance computing and compiler

  • Received Date: September 22, 2022
  • Revised Date: April 26, 2023
  • Available Online: November 27, 2023
  • In order to meet the needs of scientific research and engineering applications, the performance of high-performance computers has been continuously improved, the system scale has continued to increase, and the error rate of systems and applications has inevitably continued to increase. Quickly discovering and locating system and application-level errors and providing high-quality services to users have become issues that need to be considered urgently in the design and development of supercomputer systems. Hardware failures and exceptions, software program errors, etc. in supercomputer systems will cause users to hang up and exit large-scale parallel applications. How to quickly and accurately locate the fault site, so that administrators or users can view abnormal faults for high-precision and high-efficiency diagnosis based on this, is an important basis for maintaining the reliability of high-performance computing systems. According to the architecture of “New Generation Sunway Supercomputer” and the characteristics of SW26010-Pro many-core processor, a runtime fault location method is proposed, including fault correlation analysis based on message passing, online comprehensive analysis and diagnosis based on global aggregated information, The key technologies such as abnormal thread filtering method for Shenwei many-core processors are expounded on how to effectively detect, collect, and process abnormal information of a large number of system resources and parallel processes, so as to deal with the problem of efficient fault location system in ultra-large-scale high-performance computing in the future.

  • [1]
    张云泉,袁良,袁国兴,等. 2022年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿,2022,4(6):3−12

    Zhang Yunquan, Yuan Liang, Yuan Guoxing , et al. Present situation and prospect of high performance computers in China in 2022[J] . Frontiers of Data & Computing, 2022, 4(6): 3−12 (in Chinese)
    [2]
    Liu Qing, Logan J, Tian Yuan, et al. Hello ADIOS: The challenges and lessons of developing leadership class I/O frameworks[J]. Concurrency and Computation: Practice and Experience, 2014, 26(7): 1453−1473
    [3]
    Liang Yinglung, Zhang Yanyong, Sivasubramaniam A, et al. BlueGene/L failure analysis and prediction models[C/OL]//Proc of the 36th Int Conf on Dependable Systems and Networks. Piscataway, NJ: IEEE, 2006[2022-10-08].https://ieeexplore.ieee.org/document/1633531
    [4]
    Liang Yinglung, Zhang Yanyong, Xiong Hui, et al. Failure prediction in IBM BlueGene/L event logs[C]//Proc of the 7th Int Conf on Data Mining. Piscataway, NJ: IEEE, 2007: 583−588
    [5]
    Yu Li, Zheng Ziming, Lan Zhiling , et al. Practical online failure prediction for BlueGene/P: Period-based vs event-driven[C]// Proc of the 41st IEEE/IFIP Int Conf on Dependable Systems & Networks Workshops. Piscataway, NJ: IEEE, 2011: 259−264
    [6]
    Ostrouchov G , Maxwell D E , Ashraf R A , et al. GPU lifetimes on Titan supercomputer: Survival analysis and reliability[C]// Proc of the 39th Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020: 412−421
    [7]
    韩琦琦,刘鑫,曾云辉,等. 海洋数值模式运行管理系统的设计与实现[J]. 计算机应用与软件,2020,37(4):6−11

    Han Qiqi, Liu xin, Zeng Yunhui, et al. Design and implementation of ocean numerical operation management system[J]. Compute Applications and Software, 2020, 37(4): 6−11 (in Chinese)
    [8]
    Snir M , Wisniewski R W , Abraham J A , et al. Addressing failures in exascale computing[J]. International Journal of High Performance Computing Applications, 2014, 28(2): 129−173
    [9]
    Dong H A , Supinski B , Laguna I , et al. Scalable temporal order analysis for large scale debugging[C]//Proc of the 28th Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2009: 44−54
    [10]
    Laguna I, Dong H A, Supinski B, et al. Probabilistic diagnosis of performance faults in large-scale parallel applications[C]// Proc of the 21st Int Conf on Parallel Architectures and Compilation Techniques. Piscataway, NJ: IEEE, 2012: 213−222
    [11]
    Meuer H W. The TOP500 Project: Looking Back over 15 Years of Supercomputing Experience[J]. PIK−Praxis der Informationsverarbeitung und Kommunikation, 2008, 31(3): 203−222
    [12]
    曹宗雁. 高性能计算集群运行时环境的配置优化[J]. 科研信息化技术与应用,2011,2(6):52−61

    Cao Zongyan. Configuration optimization of high performance computing cluster runtime[J]. E-science Technology & Application, 2011, 2(6): 52−61 (in Chinese)
    [13]
    Mitra S , Laguna I, Ahn D H , et al. Accurate application progress analysis for large-scale parallel debugging[C]// Proc of the 35th ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2014: 193−203
    [14]
    Zhang Guozhen, Liu Yi , Yang Hailong , et al. A lightweight and flexible tool for distinguishing between hardware malfunctions and program bugs in debugging large-scale programs[J]. IEEE Journal Article, 2018, 6(4): 71892−71905
    [15]
    高剑,于康,卿鹏,等. 面向高性能计算的分布式故障定位框架[J]. 计算机应用,2018,38(1):44−49

    Gao Jian, Yu Kang, Qing Peng, et al. Distributed fault location framework for high performance computing[J]. Journal of Computer Applications, 2018, 38(1): 44−49
    [16]
    Liu Qingrui , Jung C , Lee D , et al. Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery[C]// Proc of the 35th Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2016: 228−239
    [17]
    Fiala D, Mueller F, Engelmann C, et al. Detection and correction of silent data corruption for large-scale high-performance computing[C]// Proc of the 25th IEEE Int Symp on Parallel and Distributed Processing. Piscataway, NJ: IEEE, 2011: 2069−2072
    [18]
    高剑刚,卢宏生,何王全,等. 神威E级原型机互连网络和消息机制[J]. 计算机学报,2021,44(1):222−234

    Gao Jiangang, Lu Hongsheng, He Wangquan, et al. Interconnection network and message mechanism of Sunway E-class prototype[J]. Chinese Journal of Computers, 2021, 44(1): 222−234 (in Chinese)
    [19]
    建澜涛,任秀江,张祯,等. E级高性能计算机的维护故障诊断系统研究[J]. 计算机工程,2022,48(12):24−37

    Jian Lantao, Ren Xiujiang, Zhang Zhen, et al. Research on maintenance fault diagnosis system of E-class high performance computer[J]. Computer Engineering, 2022, 48(12): 24−37 (in Chinese)
    [20]
    洪文杰,李肯立,全哲,等. 面向神威·太湖之光的PETSc可扩展异构并行算法及其性能优化[J]. 计算机学报,2017,40(9):2057−2069

    Hong Wenjie, Li Kenli, Quan Zhe, et al. PETSc’s heterogeneous parallel algorithm design and performance optimization on the Sunway TaihuLight system[J]. Chinese Journal of Computers, 2017, 40(9): 2057−2069 (in Chinese)
  • Related Articles

    [1]Cao Yiran, Zhu Youwen, He Xingyu, Zhang Yue. Utility-Optimized Local Differential Privacy Set-Valued Data Frequency Estimation Mechanism[J]. Journal of Computer Research and Development, 2022, 59(10): 2261-2274. DOI: 10.7544/issn1000-1239.20220504
    [2]Hong Jinxin, Wu Yingjie, Cai Jianping, Sun Lan. Differentially Private High-Dimensional Binary Data Publication via Attribute Segmentation[J]. Journal of Computer Research and Development, 2022, 59(1): 182-196. DOI: 10.7544/issn1000-1239.20200701
    [3]Wu Wanqing, Zhao Yongxin, Wang Qiao, Di Chaofan. A Safe Storage and Release Method of Trajectory Data Satisfying Differential Privacy[J]. Journal of Computer Research and Development, 2021, 58(11): 2430-2443. DOI: 10.7544/issn1000-1239.2021.20210589
    [4]Zhang Yuxuan, Wei Jianghong, Li Ji, Liu Wenfen, Hu Xuexian. Graph Degree Histogram Publication Method with Node-Differential Privacy[J]. Journal of Computer Research and Development, 2019, 56(3): 508-520. DOI: 10.7544/issn1000-1239.2019.20170886
    [5]Zhu Weijun, You Qingguang, Yang Weidong, Zhou Qinglei. Trajectory Privacy Preserving Based on Statistical Differential Privacy[J]. Journal of Computer Research and Development, 2017, 54(12): 2825-2832. DOI: 10.7544/issn1000-1239.2017.20160647
    [6]Wu Yingjie, Zhang Liqun, Kang Jian, Wang Yilei. An Algorithm for Differential Privacy Streaming Data Adaptive Publication[J]. Journal of Computer Research and Development, 2017, 54(12): 2805-2817. DOI: 10.7544/issn1000-1239.2017.20160555
    [7]Wang Liang, Wang Weiping, Meng Dan. Privacy Preserving Data Publishing via Weighted Bayesian Networks[J]. Journal of Computer Research and Development, 2016, 53(10): 2343-2353. DOI: 10.7544/issn1000-1239.2016.20160465
    [8]Lu Guoqing, Zhang Xiaojian, Ding Liping, Li Yanfeng, Liao Xin. Frequent Sequential Pattern Mining under Differential Privacy[J]. Journal of Computer Research and Development, 2015, 52(12): 2789-2801. DOI: 10.7544/issn1000-1239.2015.20140516
    [9]Ouyang Jia, Yin Jian, Liu Shaopeng, Liu Yubao. An Effective Differential Privacy Transaction Data Publication Strategy[J]. Journal of Computer Research and Development, 2014, 51(10): 2195-2205. DOI: 10.7544/issn1000-1239.2014.20130824
    [10]Ni Weiwei, Chen Geng, Chong Zhihong, Wu Yingjie. Privacy-Preserving Data Publication for Clustering[J]. Journal of Computer Research and Development, 2012, 49(5): 1095-1104.
  • Cited by

    Periodical cited type(5)

    1. 张涵,于航,周继威,白云开,赵路坦. 面向隐私计算的可信执行环境综述. 计算机应用. 2025(02): 467-481 .
    2. 付裕,林璟锵,冯登国. 虚拟化与密码技术应用:现状与未来. 密码学报(中英文). 2024(01): 3-21 .
    3. 徐传康,李忠月,刘天宇,种统洪,杨发雪. 基于可信执行环境的汽车域控系统安全研究. 汽车实用技术. 2024(15): 18-25+73 .
    4. 徐文嘉,岑孟杰,陈亮. 隐私保护下单细胞RNA测序数据细胞分类研究. 医学信息学杂志. 2024(10): 86-89 .
    5. 孙钰,熊高剑,刘潇,李燕. 基于可信执行环境的安全推理研究进展. 信息网络安全. 2024(12): 1799-1818 .

    Other cited types(4)

Catalog

    Article views (245) PDF downloads (110) Cited by(9)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return