高级检索

    大规模异构一致性融合计算系统的性能建模与优化

    Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System

    • 摘要: 随着大规模人工智能应用的普及与发展,工业界和学术界对于人工智能算力的需求逐渐提升,结合了异构计算技术与缓存一致性技术的异构一致性融合计算系统逐渐成为未来构建智算中心的重要解决方案. 然而,由于异构计算和一致性互连技术尚不成熟,现有工作难以实现对该系统进行性能建模,导致研究者无法以低成本完成异构一致性融合计算系统的建设方案评估、计算性能预测以及系统优化方法评测等工作. 本研究提出了HCSim,一种面向异构一致性融合计算系统的性能建模工具,解决了现有建模仿真研究中对该系统拓扑架构建模困难、对一致性系统中工作负载建模不准确等问题,为研究者提供了一个可灵活建模、评估互连拓扑与AI计算任务的低成本、高效建模仿真工具. 利用HCSim,本研究建模了千卡互连的异构一致性融合计算系统,并在该系统上模拟了LLAMA2-13B大语言模型的数据并行分布式训练任务,探究了异构算力分布、带宽、时延和任务规模等变量对系统性能与任务执行效率的影响. 进一步地,本研究还针对异构一致性融合计算系统的通信问题,设计了相应的优化方案,并利用HCSim进行了效果验证. 仿真结果说明HCSim不仅能够满足异构一致性融合计算系统的性能建模需求,同时也可以被应用于评估、验证异构一致性融合计算系统的优化方案.

       

      Abstract: With the widespread adoption and development of large-scale artificial intelligence applications, the demand for computing power in artificial intelligence from both industry and academia is increasing. Heterogeneous consistency integrated computing systems, which combine heterogeneous computing technology with cache consistency technology, are gradually becoming an important solution for building intelligent computing centers in the future. However, due to the immaturity of heterogeneous computing and consistency interconnect technologies, it is hard for existing research to model the performance of such systems, making it difficult for researchers to evaluate construction schemes, predict computing performance, and assess system optimization methods at a low cost. This study proposes HCSim, a performance modeling tool for heterogeneous consistency integrated computing systems, addressing challenges in modeling system topology and inaccuracies in workload modeling within consistency systems. HCSim provides researchers with a flexible, low-cost, and efficient modeling and simulation tool for evaluating interconnect topologies and AI computing tasks. Using HCSim, this study models a heterogeneous consistency integrated computing system with thousands of accelerators, and simulates the data-parallel distributed training task of the LLAMA2-13B large language model on this system, exploring the impact of variables such as heterogeneous computing power distribution, bandwidth, latency, and task scale on system performance and task execution efficiency. Furthermore, the study also designs optimization strategies for the communication issues in heterogeneous consistency integrated computing systems and validates the effectiveness of these strategies using HCSim. The simulation results show that HCSim not only meets the performance modeling needs of heterogeneous consistency integrated computing systems, but can also be applied to evaluate and verify optimization strategies for such systems.

       

    /

    返回文章
    返回