Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System

Li Rengang; Tang Yinan; Guo Zhenhua; Wang Li; Zong Zan; Yang Guangwen

doi:10.7544/issn1000-1239.202550120

Li Rengang, Tang Yinan, Guo Zhenhua, Wang Li, Zong Zan, Yang Guangwen. Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System[J]. Journal of Computer Research and Development, 2025, 62(6): 1363-1379. DOI: 10.7544/issn1000-1239.202550120

Citation:

Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System

Graphical Abstract

Graphical Abstract

Abstract

Abstract

With the widespread adoption and development of large-scale artificial intelligence applications, the demand for computing power in artificial intelligence from both industry and academia is increasing. Heterogeneous consistency integrated computing systems, which combine heterogeneous computing technology with cache consistency technology, are gradually becoming an important solution for building intelligent computing centers in the future. However, due to the immaturity of heterogeneous computing and consistency interconnect technologies, it is hard for existing research to model the performance of such systems, making it difficult for researchers to evaluate construction schemes, predict computing performance, and assess system optimization methods at a low cost. We propose HCSim, a performance modeling tool for heterogeneous consistency integrated computing systems, addressing challenges in modeling system topology and inaccuracies in workload modeling within consistency systems. HCSim provides researchers with a flexible, low-cost, and efficient modeling and simulation tool for evaluating interconnect topologies and AI computing tasks. Using HCSim, we model a heterogeneous consistency integrated computing system with thousands of accelerators, and simulate the data-parallel distributed training task of the LLAMA2-13B large language model (LLM) on this system, exploring the impact of variables such as heterogeneous computing power distribution, bandwidth, latency, and task scale on system performance and task execution efficiency. Furthermore, we also design optimization strategies for the communication issues in heterogeneous consistency integrated computing systems and validate the effectiveness of these strategies using HCSim. The simulation results show that HCSim not only meets the performance modeling needs of heterogeneous consistency integrated computing systems, but can also be applied to evaluate and verify optimization strategies for such systems.

FullText(HTML)

References (52)

Supplements (1)

Cited By

Turn off MathJax

Article Contents

Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content