Advanced Search
    Gao Jiangang, Zheng Yan, Yu Kang, Peng Dajia, Li Hongliang, Liu Yong, He Wangquan, Chen Dexun, Wang Fei. Runtime Fault Location Method for Sunway Supercomputer[J]. Journal of Computer Research and Development, 2024, 61(1): 86-97. DOI: 10.7544/issn1000-1239.202220821
    Citation: Gao Jiangang, Zheng Yan, Yu Kang, Peng Dajia, Li Hongliang, Liu Yong, He Wangquan, Chen Dexun, Wang Fei. Runtime Fault Location Method for Sunway Supercomputer[J]. Journal of Computer Research and Development, 2024, 61(1): 86-97. DOI: 10.7544/issn1000-1239.202220821

    Runtime Fault Location Method for Sunway Supercomputer

    • In order to meet the needs of scientific research and engineering applications, the performance of high-performance computers has been continuously improved, the system scale has continued to increase, and the error rate of systems and applications has inevitably continued to increase. Quickly discovering and locating system and application-level errors and providing high-quality services to users have become issues that need to be considered urgently in the design and development of supercomputer systems. Hardware failures and exceptions, software program errors, etc. in supercomputer systems will cause users to hang up and exit large-scale parallel applications. How to quickly and accurately locate the fault site, so that administrators or users can view abnormal faults for high-precision and high-efficiency diagnosis based on this, is an important basis for maintaining the reliability of high-performance computing systems. According to the architecture of “New Generation Sunway Supercomputer” and the characteristics of SW26010-Pro many-core processor, a runtime fault location method is proposed, including fault correlation analysis based on message passing, online comprehensive analysis and diagnosis based on global aggregated information, The key technologies such as abnormal thread filtering method for Shenwei many-core processors are expounded on how to effectively detect, collect, and process abnormal information of a large number of system resources and parallel processes, so as to deal with the problem of efficient fault location system in ultra-large-scale high-performance computing in the future.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return