Abstract:
With the widespread adoption and development of large-scale artificial intelligence applications, the demand for computing power in artificial intelligence from both industry and academia is increasing. Heterogeneous consistency integrated computing systems, which combine heterogeneous computing technology with cache consistency technology, are gradually becoming an important solution for building intelligent computing centers in the future. However, due to the immaturity of heterogeneous computing and consistency interconnect technologies, it is hard for existing research to model the performance of such systems, making it difficult for researchers to evaluate construction schemes, predict computing performance, and assess system optimization methods at a low cost. This study proposes HCSim, a performance modeling tool for heterogeneous consistency integrated computing systems, addressing challenges in modeling system topology and inaccuracies in workload modeling within consistency systems. HCSim provides researchers with a flexible, low-cost, and efficient modeling and simulation tool for evaluating interconnect topologies and AI computing tasks. Using HCSim, this study models a heterogeneous consistency integrated computing system with thousands of accelerators, and simulates the data-parallel distributed training task of the LLAMA2-13B large language model on this system, exploring the impact of variables such as heterogeneous computing power distribution, bandwidth, latency, and task scale on system performance and task execution efficiency. Furthermore, the study also designs optimization strategies for the communication issues in heterogeneous consistency integrated computing systems and validates the effectiveness of these strategies using HCSim. The simulation results show that HCSim not only meets the performance modeling needs of heterogeneous consistency integrated computing systems, but can also be applied to evaluate and verify optimization strategies for such systems.