Citation: | Zhao Yulong, Gu Yanqing, Tian Songtao, Wu Chunzhi, Tang Lingtao, Zhang Lufei, Qin Xiaojun, Liu Xin, Chen Zuoning. SW-IntraCC: A Collective Communication Mechanism for Sunway AI Acceleration Card Internals[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550143 |
The number of large-scale language model parameters is growing exponentially, which puts forward higher requirements on the arithmetic density and communication efficiency of the acceleration card, and promotes the rapid development of new architectures, such as single-card multi-core, multi-chip and multi-communication entities. Sunway AI acceleration card adopts four-core group on-chip Ring bus architecture, but in the large model training, the data communication volume is large, and the traditional Ring collection communication method faces the core bottlenecks such as the double limitation of single-core group memory capacity and transmission bandwidth, low collection communication efficiency, and the inability to overlap the communication and computation. In this paper, the optimization framework of SW-IntraCC (Sunway-intra collective communication) is proposed by adopting the concept of software-hardware collaborative design to break through the above limitations through the three-tier storage architecture. First, the three-tier storage architecture is constructed based on on-chip high-speed Ring network, which expands the memory capacity of a single core group by up to four times and increases the host-accelerator card transmission bandwidth by three times; Second, an intra-chip cross shared communication (CSC) algorithm is designed with interleaved memory access patterns, implementing CSC-AG (CSC-AllGather) and CSC-RS (CSC-ReduceScatter) operators optimized for large model training. Benchmark results demonstrate that CSC achieves 2.15 times higher communication efficiency compared with conventional collective primitives. Finally, a bidirectional operator fusion strategy is proposed to enable communication-computation overlap, yielding a 59% improvement in communication performance after optimization.
[1] |
王睿,张留洋,高志涌,等. 面向边缘智能的大模型研究进展[J/OL]. 计算机研究与发展,[2025-02-07]. http://kns.cnki.net/kcms/detail/11.1777.TP.20250127.0921.006.html
Wang Rui, Zhang Liuyang, Gao Zhiyong, et al. Progress in research on large models for edge-oriented intelligence[J/OL]. Computer Research and Development, [2025-02-07]. http://kns.cnki.net/kcms/detail/11.1777.TP.20250127.0921.006.html (in Chinese)
|
[2] |
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning[J]. arXiv preprint, arXiv: 2501.12948, 2025. DOI: 10.48550/arXiv.2501.12948
|
[3] |
Ouyang S, Dong D Z, Xu Y M, et al. Communication optimization strategies for distributed deep neural network training: A survey[J]. Journal of Parallel and Distributed Computing, 2021, 149: 52–65. DOI: 10.1016/j.jpdc.2020.11.005
|
[4] |
Weingram A, Li Y, Qi H, et al. xCCL: A survey of industry-led collective communication libraries for deep learning[J]. Journal of Computer Science and Technology, 2023, 38: 166−195 doi: 10.1007/s11390-023-2894-6
|
[5] |
NVIDIA. NVIDIA DGX−1 System Architecture White Paper 2017[R/OL].[2024-12-30]. https://images.nvidia.cn/content/pdf/dgx1-system-architecture-whitepaper1.pdf
|
[6] |
Tipparaju V, Nieplocha J, Panda D K. (2003). Fast collective operations using shared and remote memory access protocols on clusters[C]//Proc of Int Parallel and Distributed Processing Symposium. Washington, DC: IEEE Computer Society, 2003: 10
|
[7] |
Pabst S, Koch A, Straßer W. Fast and scalable CPU/GPU collision detection for rigid and deformable surfaces[J]. Computer Graphics Forum, 2010, 29(7): 2145−2154 doi: 10.1111/j.1467-8659.2010.01802.x
|
[8] |
Xu Q, Jeon H, Annavaram M. Graph processing on GPUs: Where are the bottlenecks?[C]//Proc of 2014 IEEE Int Symp on Workload Characterization (IISWC). Piscataway, NJ: IEEE, 2014: 140−149
|
[9] |
Next Platform. The system bottleneck shifts to PCI-express[R/OL] [2017-07-14]. https://www.nextplatform.com/2017/07/14/system-bottleneck-shifts-pci-express/
|
[10] |
黄彤彤, 陈昊, 武辰飞, 等. Concurrent multi-die optimization物理实现方案的应用[J]. 电子技术应用,2023,49(8):30−35
Huang Tongtong, Chen Hao, Wu Chenfei, et al. Application of physical implementation scheme for Concurrent multi-die optimization[J]. Electronics Applications, 2023, 49(8): 30−35 (in Chinese)
|
[11] |
曾嵘浩. 基于海上的的多目标跟踪与行为分析研究[D]. 西安:西安电子科技大学,2023. DOI: 10.27389/d.cnki.gxadu.2023.001428
Zeng Ronghao. Research on multi-target tracking and behavior analysis based on maritime[D]. Xi’an: Xi’an Electronic Science and Technology University, 2023. DOI:10.27389/d.cnki.gxadu.2023.001428 (in Chinese)
|
[12] |
俞铭辉. 基于国产智能芯片的模型移植和优化的研究[D]. 杭州:杭州电子科技大学,2024. DOI: 10.27075/d.cnki.ghzdc.2024.000968
Yu Minghui. Research on model transplantation and optimization based on domestic smart chip[D]. Hangzhou: Hangzhou University of Electronic Science and Technology, 2024. DOI:10.27075/d.cnki.ghzdc.2024.000968 (in Chinese)
|
[13] |
赵玉龙, 张鲁飞, 许国春, 等. SDAA:面向申威智能加速卡的运行时系统[J]. 软件学报,2024,35(12):5710−5724
Zhao Yulong, Zhang Lufei, Xu Guochun, et al. SDAA: Runtime system for Sunway intelligent acceleration card[J]. Journal of Software, 2024, 35(12): 5710−5724 (in Chinese)
|
[14] |
Wikipedia. RDMA over converged Ethernet[EB/OL]. Wikipedia, the free encyclopedia[2024-12-01]. https://en.wikipedia.org/w/index.php?title=RDMA_over_Converged_ Ethernet&oldid=782744462, 2017.
|
[15] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Conf on Neural Information Processing Systems (NIPS 2017). Long Beach, CA: Curran Associates Inc, 2017: 5998−6008
|
[16] |
Sergeev A, Balso M D. Horovod: Fast and easy distributed deep learning in TensorFlow[J]. arXiv preprint, arXiv: 1802.05799, 2018
|
[17] |
Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint, arXiv 1909.08053, 2019
|
[18] |
Huang Y, Cheng Y, Bapna A, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism[C]//Advances in the 32nd Conf on Neural Information Processing Systems (NeurIPS 2019). Vancouver, CA: Curran Associates Inc 2019: 8−14
|
[19] |
Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: Generalized pipeline parallelism for DNN training[C/OL]//Proc of the 27th ACM Symp on Operating Systems Principles New York: ACM[2025-01-15]. https://doi.org/10.48550/arXiv.1806.03377, 2019
|
[20] |
Li S, Xue F, Li Y, et al. Sequence parallelism: Long sequence training from system perspective[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada: 2021: 2391−2404
|
[21] |
高亦沁,罗智宇,王一超,等. 面向国产超算的操作系统评测与优化[J/OL]. 计算机科学,[2025-02-11]. http://kns.cnki.net/kcms/detail/50.1075.tp.20240925.1330.007.html
Gao Yiqin, Luo Zhiyu, Wang Yichao, et al. Evaluation and optimization of operating system for domestic supercomputing[J/OL]. Computer Science, [2025-02-11]. http://kns.cnki.net/kcms/detail/50.1075.tp.20240925.1330.007.html (in Chinese)
|
[22] |
Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C]//Proc of the 11th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583−598
|
[23] |
Träff J L. Optimal, non-pipelined Reduce-scatter and Allreduce algorithms[J]. arXiv preprint, arXiv: 2410.14234, 2024
|
[24] |
Wasi-ur-Rahman M, Lu X, Islam N S, et al. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects[C]//Proc of the Int Conf on Supercomputing (ICS). New York: ACM 2014: 33−42. DOI: 10.1145/2597652.2597684
|
[25] |
Ba J, Kiros J R, Hinton G E. Layer normalization[J]. arXiv preprint, arXiv: 1607.06450, 2016
|
[26] |
Black S, Biderman S, Hallahan E, et al. GPT-NeoX−20B: An open-source autoregressive language model[J]. arXiv preprint, arXiv: 2204.06745, 2022
|
[27] |
Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
|
[1] | Jin Dongming, Jin Zhi, Chen Xiaohong, Wang Chunhui. ChatModeler: A Human-Machine Collaborative and Iterative Requirements Elicitation and Modeling Approach via Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(2): 338-350. DOI: 10.7544/issn1000-1239.202330746 |
[2] | Wang Juanjuan, Wang Hongan. Multi-Agent Multi-Criticality Scheduling Based Self-Healing System of Power Grid[J]. Journal of Computer Research and Development, 2017, 54(4): 720-730. DOI: 10.7544/issn1000-1239.2017.20161026 |
[3] | He Wenbin, Liu Qunfeng, Xiong Jinzhi. The Error Theory of Polynomial Smoothing Functions for Support Vector Machines[J]. Journal of Computer Research and Development, 2016, 53(7): 1576-1585. DOI: 10.7544/issn1000-1239.2016.20148462 |
[4] | He Wangquan, Wei Di, Quan Jianxiao, Wu Wei, Qi Fengbin. Dynamic Task Scheduling Model and Fault-Tolerant via Queuing Theory[J]. Journal of Computer Research and Development, 2016, 53(6): 1271-1280. DOI: 10.7544/issn1000-1239.2016.20148445 |
[5] | Zhao Yu, Wang Yadi, Han Jihong, Fan Yudan, and Zhang Chao. A Formal Model for Cryptographic Protocols Based on Planning Theory[J]. Journal of Computer Research and Development, 2008, 45(9). |
[6] | Shi Jin, Lu Yin, and Xie Li. Dynamic Intrusion Response Based on Game Theory[J]. Journal of Computer Research and Development, 2008, 45(5): 747-757. |
[7] | Li Ye, Cai Yunze, Yin Rupo, Xu Xiaoming. Support Vector Machine Ensemble Based on Evidence Theory for Multi-Class Classification[J]. Journal of Computer Research and Development, 2008, 45(4): 571-578. |
[8] | Lin Jianning, Wu Huizhong. Research on a Trust Model Based on the Subjective Logic Theory[J]. Journal of Computer Research and Development, 2007, 44(8): 1365-1370. |
[9] | He Lijian and Zhang Wei. An Agent Organization Structure for Solving DCOP Based on the Partitions of Constraint Graph[J]. Journal of Computer Research and Development, 2007, 44(3). |
[10] | Mu Kedian and Lin Zuoquan. Symbolic Dempster-Shafer Theory[J]. Journal of Computer Research and Development, 2005, 42(11): 1833-1842. |