• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Li Rengang, Tang Yinan, Guo Zhenhua, Wang Li, Zong Zan, Yang Guangwen. Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550120
Citation: Li Rengang, Tang Yinan, Guo Zhenhua, Wang Li, Zong Zan, Yang Guangwen. Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550120

Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System

More Information
  • Author Bio:

    Li Rengang: born in 1980. PhD candidate, senior engineer. His main research interest includes heterogeneous computing

    Tang Yinan: born in 1993. PhD. His main research interests include heterogeneous computing and simulation system

    Guo Zhenhua: born in 1988. PhD. His main research interests include computer system architecture and heterogeneous computing

    Wang Li: born in 1989. Master. Her main research interests include heterogeneous computing and artificial intelligence

    Zong Zan: born in 1994. PhD. His research interest includes performance acceleration of distributed deep learning systems and large-scale data processing systems

    Yang Guangwen: born in 1963. PhD, professor, PhD supervisor. His main research interest includes high performance computing

  • Received Date: February 28, 2025
  • Revised Date: April 07, 2025
  • Available Online: April 17, 2025
  • With the widespread adoption and development of large-scale artificial intelligence applications, the demand for computing power in artificial intelligence from both industry and academia is increasing. Heterogeneous consistency integrated computing systems, which combine heterogeneous computing technology with cache consistency technology, are gradually becoming an important solution for building intelligent computing centers in the future. However, due to the immaturity of heterogeneous computing and consistency interconnect technologies, it is hard for existing research to model the performance of such systems, making it difficult for researchers to evaluate construction schemes, predict computing performance, and assess system optimization methods at a low cost. We propose HCSim, a performance modeling tool for heterogeneous consistency integrated computing systems, addressing challenges in modeling system topology and inaccuracies in workload modeling within consistency systems. HCSim provides researchers with a flexible, low-cost, and efficient modeling and simulation tool for evaluating interconnect topologies and AI computing tasks. Using HCSim, we model a heterogeneous consistency integrated computing system with thousands of accelerators, and simulate the data-parallel distributed training task of the LLAMA2-13B large language model (LLM) on this system, exploring the impact of variables such as heterogeneous computing power distribution, bandwidth, latency, and task scale on system performance and task execution efficiency. Furthermore, we also design optimization strategies for the communication issues in heterogeneous consistency integrated computing systems and validate the effectiveness of these strategies using HCSim. The simulation results show that HCSim not only meets the performance modeling needs of heterogeneous consistency integrated computing systems, but can also be applied to evaluate and verify optimization strategies for such systems.

  • [1]
    Guo Daya, Yang Dejian, Zhang Haowei, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning[J]. arXiv preprint, arXiv: 2501.12948, 2025
    [2]
    Huang Dawei, Yan Chuan, Li Qing, et al. From large language models to large multimodal models: A literature review[J]. Applied Sciences, 2024, 14(12): 5068 doi: 10.3390/app14125068
    [3]
    Jiang Ziheng, Lin Haibin, Zhong Yinmin, et al. MegaScale: Scaling large language model training to more than 10, 000 GPUs[C]//Proc of 21st USENIX Symp on Networked Systems Design and Implementation (NSDI 24). Berkeley, CA: USENIX Association, 2024: 745−760
    [4]
    Yang Zhuoping, Ji Shixin , Chen Xingzhen, et al. Challenges and opportunities to enable large-scale computing via heterogeneous chiplets[C]//Proc of the 29th Asia and South Pacific Design Automation Conf (ASP-DAC) . Piscataway, NJ: IEEE, 2024: 765−770
    [5]
    Saghiri A M, Vahidipour S M, Jabbarpour M R, et al. A survey of artificial intelligence challenges: Analyzing the definitions, relationships, and evolutions[J]. Applied Sciences, 2022, 12(8): 4054 doi: 10.3390/app12084054
    [6]
    Rajbhandari S, Ruwase O, Rasley J, et al. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage And Analysis 2021. New York: ACM, 2021: 1−14
    [7]
    Korthikanti V A, Casper J, Lym S, et al. Reducing activation recomputation in large transformer models[C]//Proc of Machine Learning and Systems 5, 2023: 341−353. https://proceedings.mlsys.org/paper_files/paper/2023/hash/80083951326cf5b35e5100260d64ed81-Abstract-mlsys2023.html
    [8]
    Li Shen, Zhao Yanli, Varma R, et al. Pytorch distributed: Experiences on accelerating data parallel training[J]. arXiv preprint, arXiv: 2006.15704, 2020
    [9]
    Ge Suyu, Zhang Yunan, Liu Liyuan, et al. Model tells you what to discard: Adaptive KV cache compression for LLMs[J]. arXiv preprint, arXiv: 2310.01801, 2023
    [10]
    Chen Zixiang, Deng Yihe, Wu Yue, et al. Towards understanding mixture of experts in deep learning[J]. arXiv preprint, arXiv: 2208.02813, 2022
    [11]
    Xu Yi, Mahar S, Liu Ziheng, et al. CXL shared memory programming: Barely distributed and almost persistent[J]. arXiv preprint, arXiv: 2405.19626, 2024
    [12]
    Schieffer G, Wahlgren J, Ren Jie, et al. Harnessing integrated cpu-gpu system memory for HPC: A first look into grace hopper[C]//Proc of the 53rd Int Conf on Parallel Processing. New York: ACM, 2024: 199−209
    [13]
    Xia Jing, Cheng Chuanning, Zhou Xiping, et al. Kunpeng 920: The first 7-nm chiplet-based 64-core arm soc for cloud services[J]. IEEE Micro, 2021, 41(5): 67−75 doi: 10.1109/MM.2021.3085578
    [14]
    Fusco L, Khalilov M, Chrapek M, et al. Understanding data movement in tightly coupled heterogeneous systems: A case study with the Grace Hopper superchip[J]. arXiv preprint, arXiv: 2408.11556, 2024
    [15]
    CXL Organization. CXL® Specification[EB/OL]. 2020[2024-12-01]. https://computeexpresslink.org/cxl-specification/.
    [16]
    Gholami A, Yao Zheiwei, Kim S, et al. AI and memory wall[J]. IEEE Micro, 2024, 44(3): 33−39
    [17]
    Casanova H. SimGrid: A toolkit for the simulation of application scheduling[C]//Proc of the 1st IEEE/ACM Int Symp on Cluster Computing and the Grid. Piscataway, NJ: IEEE, 2021: 430−437
    [18]
    Casanova H, Giersch A, Legrand A, et al. Lowering entry barriers to developing custom simulators of distributed applications and platforms with SimGrid[J]. Parallel Computing, 2025: 103125. https://www.sciencedirect.com/science/article/pii/S0167819125000018
    [19]
    Saleh E, Shastry C. Simulation and modelling of task migration in distributed systems using SimGrid[C]//Proc of the Int Conf on Modeling, Simulation and Optimization. Singapore: Springer Nature Singapore, 2022: 475−486
    [20]
    Guo Zhenhua, Tang Yinan, Zhai Jidong, et al. A survey on performance modeling and prediction for distributed DNN training[J]. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(12): 2463−2478
    [21]
    Lowe-Power J, Ahmad A. M, Akram A, et al. The Gem5 simulator: Version 20.0+[J]. arXiv preprint, arXiv: 2007.03152, 2020
    [22]
    Bellard F. QEMU, a fast and portable dynamic translator[C]//Proc of USENIX Annual Technical Conf 2005. Berkeley, CA: USENIX Association, FREENIX Track, 2005: 10−5555
    [23]
    Bakhoda A, Yuan G L, Fung W, et al. Analyzing CUDA workloads using a detailed GPU simulator[C]//Proc of 2009 IEEE Int Symp on Performance Analysis of Systems and Software, Piscataway, NJ: IEEE, 2009: 163−174
    [24]
    Li Shang, Yang Zhiyuan, Reddy D, et al. DRAMSim3: A cycle-accurate, thermal-capable DRAM simulator[J]. IEEE Computer Architecture Letters, 2020, 19(2): 106−109 doi: 10.1109/LCA.2020.2973991
    [25]
    Kim Y, Yang Weikun, Mutlu O. Ramulator: A fast and extensible DRAM simulator[J]. IEEE Computer Architecture Letters, 2015, 15(1): 45−49
    [26]
    Henderson T R, Lacage M, Riley G F, et al. Network simulations with the NS−3 simulator[C]//Proc of SIGCOMM Demonstration 2008, New York: ACM, 2008. 527
    [27]
    Varga A. OMNeT++[J]. Modeling and Tools for Network Simulation, 2010: 35−59
    [28]
    Khairy M, Shen Zhesheng, Aamodt T M, et al. Accel-Sim: An extensible simulation framework for validated GPU modeling[C]//Proc of 2020 ACM/IEEE 47th Annual Int Symp on Computer Architecture (ISCA). Piscataway, NJ: IEEE, 2020: 473−486
    [29]
    Won W, Heo T, Rashidi S, et al. Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale[C]//Proc of 2023 IEEE Int Symp on Performance Analysis of Systems and Software (ISPASS). Piscataway, NJ: IEEE, 2023: 283−294
    [30]
    Moolchandani D, Kundu J, Ruelens F, et al. AMPeD: An analytical model for performance in distributed training of transformers[C]//Proc of 2023 IEEE Int Symp on Performance Analysis of Systems and Software (ISPASS). Piscataway, NJ: IEEE, 2023: 306−315
    [31]
    Isaev M, McDonald N, Dennison L, et al. Calculon: A methodology and tool for high-level co-design of systems and large language models[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2023: 1−14
    [32]
    Qi Hang, Sparks E R, Talwalkar A. Paleo: A performance model for deep neural networks[C]//Proc of Int Conf on Learning Representations. Toulon, France 2017: 1−10
    [33]
    Lu Wenyan, Yan Guihai, Li Jiajun, et al. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks[C]//Proc of 2017 IEEE Int Symp on High Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2017: 553−564
    [34]
    Zhu Hongyu, Phanishayee A, Pekhimenko G. Daydream: Accurately estimating the efficacy of optimizations for DNN training[C]//Proc of 2020 USENIX Annual Technical Conf (USENIX ATC 20) . Berkeley, CA: USENIX Association, 2020: 337−352
    [35]
    Hu Hanpeng, Jiang Chenyu, Zhong Yuchen, et al. Dpro: A generic performance diagnosis and optimization toolkit for expediting distributed DNN training[C]//Proc of Machine Learning and Systems. New York: ACM, 2022: 623−637
    [36]
    Lu Guandong, Chen Runzhe, Wang Yakai, et al. DistSim: A performance model of large-scale hybrid distributed DNN training[C]//Proc of the 20th ACM Int Conf on Computing Frontiers. New York: ACM, 2023: 112−122
    [37]
    Santhanam K, Krishna S, Tomioka R, et al. DistIR: An intermediate representation for optimizing distributed neural networks[C]//Proc of the 1st Workshop on Machine Learning and Systems. New York: ACM, 2021: 15−23
    [38]
    Lattner C, Amini M, Bondhugula U, et al. MLIR: Scaling compiler infrastructure for domain specific computation[C]//Proc of 2021 IEEE/ACM Int Symp on Code Generation and Optimization (CGO). Piscataway, NJ: IEEE, 2021: 2−14
    [39]
    Duan Jiangfei, Li Xiuhong, Xu Ping, et al. Proteus: Simulating the performance of distributed DNN training[J]. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(10): 1867−1878 doi: 10.1109/TPDS.2024.3443255
    [40]
    Zhang Shiwei, Yi Xiaodong, Diao Lansong, et al. Expediting distributed DNN training with device topology-aware graph deployment[J]. IEEE Transactions on Parallel and Distributed Systems, 2023, 34(4): 1281−1293 doi: 10.1109/TPDS.2023.3243261
    [41]
    Wang Haoran, Tachon T, Li Chong, et al. SMSG: Profiling-free parallelism modeling for distributed training of DNN[J]. International Journal of Parallel Programming, 2023, 51(2): 109−127
    [42]
    Rashidi S, Sridharan S, Srinivasan S, et al. Astra-sim: Enabling sw/hw co-design exploration for distributed DL training platforms[C]//Proc of 2020 IEEE Int Symp on Performance Analysis of Systems and Software (ISPASS). Piscataway, NJ: IEEE, 2020: 81−92
    [43]
    Samajdar A, Zhu Yuhao, Whatmough P, et al. Scale-sim: Systolic CNN accelerator simulator[J]. arXiv preprint, arXiv: 1811.02883, 2018
    [44]
    Liu ZhiGang, Whatmough P N, Mattina M. Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference[J]. IEEE Computer Architecture Letters, 2020, 19(1): 34−37 doi: 10.1109/LCA.2020.2979965
    [45]
    Hagberg A, Swart P J, Schult D A. Exploring network structure, dynamics, and function using NetworkX (No. LA-UR-08-05495; LA-UR-08-5495)[R]. Los Alamos, NM: Los Alamos National Laboratory (LANL), 2008
    [46]
    The SimGrid Team. The SimGrid models[EB/OL]. 2002[2025-01-17]. https://simgrid.frama.io/simgrid/Models.html#cm02
    [47]
    IEIT Systems. meta brain® Artificial Intelligence Servers > AI > Servers > NF5468A5[EB/OL]. [2025-03-01]. https://en.ieisystem.com/product/ai/9573.html
    [48]
    Li Ang, Song S L, Chen Jieyang, et al. Evaluating modern GPU interconnect: PCIe, nvlink, nv-sli, NVSwitch and gpudirect[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 31(1): 94−110
    [49]
    Das Sharma D, Blankenship R, Berger D. An introduction to the compute express link (CXL) Interconnect[J]. ACM Computing Surveys, 2024, 56(11): 1−37
    [50]
    王彦伟,李仁刚,徐冉,等. 基于可重构架构的数据中心异构加速软硬件系统级平台[J]. 计算机研究与发展,2024,61(6):1388−1400 doi: 10.7544/issn1000-1239.202440055

    Wang Yanwei, Li Rengang, Xu Ran, et al. Data center heterogeneous acceleration software-hardware system-level platform based on reconfigurable architecture[J]. Journal of Computer Research and Development, 2024, 61(6): 1388−1400 (in Chinese) doi: 10.7544/issn1000-1239.202440055
    [51]
    葛旭冉,欧洋,王博,等. 大模型推理中的存储优化技术研究综述[J]. 计算机研究与发展,2025,62(3):545−562

    Ge Xuran, Ou Yang, Wang Bo, et al A survey of storage optimization techniques in large language model inference[J]. Journal of Computer Research and Development, 2025, 62(3): 545−562 (in Chinese)
    [52]
    Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
  • Related Articles

    [1]Jin Dongming, Jin Zhi, Chen Xiaohong, Wang Chunhui. ChatModeler: A Human-Machine Collaborative and Iterative Requirements Elicitation and Modeling Approach via Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(2): 338-350. DOI: 10.7544/issn1000-1239.202330746
    [2]Wang Juanjuan, Wang Hongan. Multi-Agent Multi-Criticality Scheduling Based Self-Healing System of Power Grid[J]. Journal of Computer Research and Development, 2017, 54(4): 720-730. DOI: 10.7544/issn1000-1239.2017.20161026
    [3]He Wenbin, Liu Qunfeng, Xiong Jinzhi. The Error Theory of Polynomial Smoothing Functions for Support Vector Machines[J]. Journal of Computer Research and Development, 2016, 53(7): 1576-1585. DOI: 10.7544/issn1000-1239.2016.20148462
    [4]He Wangquan, Wei Di, Quan Jianxiao, Wu Wei, Qi Fengbin. Dynamic Task Scheduling Model and Fault-Tolerant via Queuing Theory[J]. Journal of Computer Research and Development, 2016, 53(6): 1271-1280. DOI: 10.7544/issn1000-1239.2016.20148445
    [5]Zhao Yu, Wang Yadi, Han Jihong, Fan Yudan, and Zhang Chao. A Formal Model for Cryptographic Protocols Based on Planning Theory[J]. Journal of Computer Research and Development, 2008, 45(9).
    [6]Shi Jin, Lu Yin, and Xie Li. Dynamic Intrusion Response Based on Game Theory[J]. Journal of Computer Research and Development, 2008, 45(5): 747-757.
    [7]Li Ye, Cai Yunze, Yin Rupo, Xu Xiaoming. Support Vector Machine Ensemble Based on Evidence Theory for Multi-Class Classification[J]. Journal of Computer Research and Development, 2008, 45(4): 571-578.
    [8]Lin Jianning, Wu Huizhong. Research on a Trust Model Based on the Subjective Logic Theory[J]. Journal of Computer Research and Development, 2007, 44(8): 1365-1370.
    [9]He Lijian and Zhang Wei. An Agent Organization Structure for Solving DCOP Based on the Partitions of Constraint Graph[J]. Journal of Computer Research and Development, 2007, 44(3).
    [10]Mu Kedian and Lin Zuoquan. Symbolic Dempster-Shafer Theory[J]. Journal of Computer Research and Development, 2005, 42(11): 1833-1842.

Catalog

    Article views (33) PDF downloads (14) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return