Abstract:
In order to meet the needs of scientific research and engineering applications, the performance of high-performance computers has been continuously improved, the system scale has continued to increase, and the error rate of systems and applications has inevitably continued to increase. Quickly discovering and locating system and application-level errors and providing high-quality services to users have become issues that need to be considered urgently in the design and development of supercomputer systems. Hardware failures and exceptions, software program errors, etc. in supercomputer systems will cause users to hang up and exit large-scale parallel applications. How to quickly and accurately locate the fault site, so that administrators or users can view abnormal faults for high-precision and high-efficiency diagnosis based on this, is an important basis for maintaining the reliability of high-performance computing systems. According to the architecture of “New Generation Sunway Supercomputer” and the characteristics of SW26010-Pro many-core processor, a runtime fault location method is proposed, including fault correlation analysis based on message passing, online comprehensive analysis and diagnosis based on global aggregated information, The key technologies such as abnormal thread filtering method for Shenwei many-core processors are expounded on how to effectively detect, collect, and process abnormal information of a large number of system resources and parallel processes, so as to deal with the problem of efficient fault location system in ultra-large-scale high-performance computing in the future.