-
摘要:
随着高性能计算机的性能不断提升、系统规模不断提高,系统和应用的错误率也不可避免地持续增多. 快速发现和定位系统及应用级的错误、为用户提供高质量服务,成为了超级计算机系统设计开发过程中急需考虑的问题. 超级计算机系统中硬件故障与异常、软件程序的错误等都会导致用户大规模并行应用的错误、挂死与退出. 如何快速准确定位错误现场,让管理员或用户以此为基础查看异常发生的故障进行高精度、高效率的诊断,是维护高性能计算系统可靠性的重要基础. 高性能计算机传统的故障定位主要通过硬件异常跟踪、系统日志分析和程序主动探测等方法,缺乏对无日志信息、无明显故障现象的程序挂死问题的定位手段,并且技术的扩展性也面临挑战. 针对“新一代神威超级计算机”体系结构和SW26010-Pro众核处理器特点,提出一种运行时故障定位方法,包括基于消息传递的故障关联分析、基于全局聚合信息的在线综合分析诊断、面向申威众核处理器的异常线程过滤方法等关键技术,阐述了如何有效检测、收集、处理大量系统资源和并行进程的异常信息问题,为应对未来超大规模高性能计算中故障高效定位难题提供有效支撑.
Abstract:In order to meet the needs of scientific research and engineering applications, the performance of high-performance computers has been continuously improved, the system scale has continued to increase, and the error rate of systems and applications has inevitably continued to increase. Quickly discovering and locating system and application-level errors and providing high-quality services to users have become issues that need to be considered urgently in the design and development of supercomputer systems. Hardware failures and exceptions, software program errors, etc. in supercomputer systems will cause users to hang up and exit large-scale parallel applications. How to quickly and accurately locate the fault site, so that administrators or users can view abnormal faults for high-precision and high-efficiency diagnosis based on this, is an important basis for maintaining the reliability of high-performance computing systems. According to the architecture of “New Generation Sunway Supercomputer” and the characteristics of SW26010-Pro many-core processor, a runtime fault location method is proposed, including fault correlation analysis based on message passing, online comprehensive analysis and diagnosis based on global aggregated information, The key technologies such as abnormal thread filtering method for Shenwei many-core processors are expounded on how to effectively detect, collect, and process abnormal information of a large number of system resources and parallel processes, so as to deal with the problem of efficient fault location system in ultra-large-scale high-performance computing in the future.
-
Keywords:
- exascale computer /
- reliability /
- fault location /
- runtime /
- many-core processor
-
终端网络是互联网的重要组成部分,它连接骨干网络和终端网络,对用户体验的影响最为直接. 随着5G/6G、物联网等技术的发展,终端网络的性能需求不断提升,承载着诸如智慧城市和工业互联网等新兴应用,是推动社会数字化转型的重要基础设施,是未来网络演进不可忽视的重要研究对象. 清华大学李振华教授团队通过分析终端网络中存在的用户困惑和技术鸿沟问题,从“可用性、可靠性、可信性”三个关键维度进行研究,提出云原生强化设计的理念,实现终端网络大规模的测量分析与设计优化,并在多个工业系统中取得了良好的应用效果. 文章突出从用户视角出发的设计思想,对提升网络终端的可用性、可靠性与安全性做出了系统性的探索,主要包括以下三个核心点:
1)针对终端网络带给用户的主要困惑,从网速、断连、安全和代际角度全面分析,阐述克服经典设计模式潜在缺陷的研究动力,通过剖析大规模工业终端网络在多样化使用场景下的性能落差问题,总结动机、场景、资源和知识方面的研发鸿沟,为克服现存技术挑战指明解决方向.
2)围绕云原生强化设计的创新模式,综合考量技术和非技术多方面因素,利用服务器无感知基础设施、以微服务形态测量分析大规模终端网络,并针对复杂场景下的异构性能缺陷,跨层跨代协同强化,自适应改进终端网络设计. 最终实现终端网络的整体完善和全面进化,让终端网络服务更加高效、安全和可靠. 这些方法对现实中的网络运营与演进具有重要借鉴意义.
3)实践效果上,该研究团队将理论设计与工业应用相结合,在不同规模和需求的多个工业系统(包括政府运营的专网、大型企业的商业系统以及创业公司的网络应用)中做了调研分析、部署实施和落地改造,有效并高效地解决了其关键问题,提升了服务质量,示范性地推动了大规模复杂终端网络的技术革新.
总体而言,该研究工作系统而全面地分析了终端网络面临的问题,并在理论和实践上进行了有益的探索,形成了一套改善网络性能的方法体系. 这对推动基于云原生的网络技术发展具有较大的参考价值. 后续工作可以在技术普适性和用户感知等方面进行拓展,以建立一个更智能、自主的网络系统,这将对万物互联时代数字社会的进步具有重要意义.
评述专家
罗军舟,教授,博士生导师.主要研究方向为计算机网络.亮点论文
李振华, 王泓懿, 李洋, 林灏, 杨昕磊. 大规模复杂终端网络的云原生强化设计[J]. 计算机研究与发展,2024,61(1):2−19. DOI: 10.7544/issn1000-1239.202330726
-
表 1 NPB3.3.1程序集时间开销百分比
Table 1 Percentage of NPB3.3.1 Assembly Time Overhead
程序 时间开销百分比/% FT 1.02 EP 0.06 CG 0.51 LU 0.38 MG 0.21 BT 1.27 SP 1.39 表 2 错误定位时间对比
Table 2 Comparison of Error Location Time
s 测试集 传统错误定位 运行时故障定位 故障类型 操作系统分析 维护系统分析 MPFL故障分析 基于运行时库的错误探测分析 基于全局聚合信息的综合诊断 异常线程过滤方法 MsgSet1 管理核心 1.7 2.5 18.4 2.4 计算核心 2.3 3.2 18.4 3.3 存储器 1.8 1.8 18.9 2.7 网络接口 2.6 3.5 2.6 1.6 19.3 操作系统 1.6 2.8 18.7 2.4 MsgSet2 程序挂死 2.3 结果错 1.7 性能异常 18.7 消息丢包 2.8 1.8 19.5 -
[1] 张云泉,袁良,袁国兴,等. 2022年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿,2022,4(6):3−12 Zhang Yunquan, Yuan Liang, Yuan Guoxing , et al. Present situation and prospect of high performance computers in China in 2022[J] . Frontiers of Data & Computing, 2022, 4(6): 3−12 (in Chinese)
[2] Liu Qing, Logan J, Tian Yuan, et al. Hello ADIOS: The challenges and lessons of developing leadership class I/O frameworks[J]. Concurrency and Computation: Practice and Experience, 2014, 26(7): 1453−1473
[3] Liang Yinglung, Zhang Yanyong, Sivasubramaniam A, et al. BlueGene/L failure analysis and prediction models[C/OL]//Proc of the 36th Int Conf on Dependable Systems and Networks. Piscataway, NJ: IEEE, 2006[2022-10-08].https://ieeexplore.ieee.org/document/1633531
[4] Liang Yinglung, Zhang Yanyong, Xiong Hui, et al. Failure prediction in IBM BlueGene/L event logs[C]//Proc of the 7th Int Conf on Data Mining. Piscataway, NJ: IEEE, 2007: 583−588
[5] Yu Li, Zheng Ziming, Lan Zhiling , et al. Practical online failure prediction for BlueGene/P: Period-based vs event-driven[C]// Proc of the 41st IEEE/IFIP Int Conf on Dependable Systems & Networks Workshops. Piscataway, NJ: IEEE, 2011: 259−264
[6] Ostrouchov G , Maxwell D E , Ashraf R A , et al. GPU lifetimes on Titan supercomputer: Survival analysis and reliability[C]// Proc of the 39th Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020: 412−421
[7] 韩琦琦,刘鑫,曾云辉,等. 海洋数值模式运行管理系统的设计与实现[J]. 计算机应用与软件,2020,37(4):6−11 Han Qiqi, Liu xin, Zeng Yunhui, et al. Design and implementation of ocean numerical operation management system[J]. Compute Applications and Software, 2020, 37(4): 6−11 (in Chinese)
[8] Snir M , Wisniewski R W , Abraham J A , et al. Addressing failures in exascale computing[J]. International Journal of High Performance Computing Applications, 2014, 28(2): 129−173
[9] Dong H A , Supinski B , Laguna I , et al. Scalable temporal order analysis for large scale debugging[C]//Proc of the 28th Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2009: 44−54
[10] Laguna I, Dong H A, Supinski B, et al. Probabilistic diagnosis of performance faults in large-scale parallel applications[C]// Proc of the 21st Int Conf on Parallel Architectures and Compilation Techniques. Piscataway, NJ: IEEE, 2012: 213−222
[11] Meuer H W. The TOP500 Project: Looking Back over 15 Years of Supercomputing Experience[J]. PIK−Praxis der Informationsverarbeitung und Kommunikation, 2008, 31(3): 203−222
[12] 曹宗雁. 高性能计算集群运行时环境的配置优化[J]. 科研信息化技术与应用,2011,2(6):52−61 Cao Zongyan. Configuration optimization of high performance computing cluster runtime[J]. E-science Technology & Application, 2011, 2(6): 52−61 (in Chinese)
[13] Mitra S , Laguna I, Ahn D H , et al. Accurate application progress analysis for large-scale parallel debugging[C]// Proc of the 35th ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2014: 193−203
[14] Zhang Guozhen, Liu Yi , Yang Hailong , et al. A lightweight and flexible tool for distinguishing between hardware malfunctions and program bugs in debugging large-scale programs[J]. IEEE Journal Article, 2018, 6(4): 71892−71905
[15] 高剑,于康,卿鹏,等. 面向高性能计算的分布式故障定位框架[J]. 计算机应用,2018,38(1):44−49 Gao Jian, Yu Kang, Qing Peng, et al. Distributed fault location framework for high performance computing[J]. Journal of Computer Applications, 2018, 38(1): 44−49
[16] Liu Qingrui , Jung C , Lee D , et al. Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery[C]// Proc of the 35th Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2016: 228−239
[17] Fiala D, Mueller F, Engelmann C, et al. Detection and correction of silent data corruption for large-scale high-performance computing[C]// Proc of the 25th IEEE Int Symp on Parallel and Distributed Processing. Piscataway, NJ: IEEE, 2011: 2069−2072
[18] 高剑刚,卢宏生,何王全,等. 神威E级原型机互连网络和消息机制[J]. 计算机学报,2021,44(1):222−234 Gao Jiangang, Lu Hongsheng, He Wangquan, et al. Interconnection network and message mechanism of Sunway E-class prototype[J]. Chinese Journal of Computers, 2021, 44(1): 222−234 (in Chinese)
[19] 建澜涛,任秀江,张祯,等. E级高性能计算机的维护故障诊断系统研究[J]. 计算机工程,2022,48(12):24−37 Jian Lantao, Ren Xiujiang, Zhang Zhen, et al. Research on maintenance fault diagnosis system of E-class high performance computer[J]. Computer Engineering, 2022, 48(12): 24−37 (in Chinese)
[20] 洪文杰,李肯立,全哲,等. 面向神威·太湖之光的PETSc可扩展异构并行算法及其性能优化[J]. 计算机学报,2017,40(9):2057−2069 Hong Wenjie, Li Kenli, Quan Zhe, et al. PETSc’s heterogeneous parallel algorithm design and performance optimization on the Sunway TaihuLight system[J]. Chinese Journal of Computers, 2017, 40(9): 2057−2069 (in Chinese)
-
期刊类型引用(1)
1. 王星宇. 浅析新时代背景下计算机科学技术发展的新方向. 数字通信世界. 2024(03): 164-166 . 百度学术
其他类型引用(0)