SW-IntraCC: A Collective Communication Mechanism for Sunway AI Acceleration Card Internals

Zhao Yulong; Gu Yanqing; Tian Songtao; Wu Chunzhi; Tang Lingtao; Zhang Lufei; Qin Xiaojun; Liu Xin; Chen Zuoning

doi:10.7544/issn1000-1239.202550143

Journal of Computer Research and Development > 2025 > Corrected proof > DOI: 10.7544/issn1000-1239.202550143 CSTR: 32373.14.issn1000-1239.202550143

Zhao Yulong, Gu Yanqing, Tian Songtao, Wu Chunzhi, Tang Lingtao, Zhang Lufei, Qin Xiaojun, Liu Xin, Chen Zuoning. SW-IntraCC: A Collective Communication Mechanism for Sunway AI Acceleration Card Internals[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550143

Citation:

PDF (1510 KB)

SW-IntraCC: A Collective Communication Mechanism for Sunway AI Acceleration Card Internals

1.
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125
2.
Normal School of Aerospace Engineering University, Beijing 102200
3.
National Parallel Computer Engineering Technology Research Center, Beijing 100083

More Information

Author Bio:
Zhao Yulong: born in 1984. PhD candidate. His main research interest includes artificial intelligence infrastructure software

Gu Yanqing: born in 1992. Master. His main research interests include high-performance computing and operating systems

Tian Songtao: born in 1983. Bachelor. His main research interests include high-performance computing and operating systems

Wu Chunzhi: born in 1991. PhD candidate. His main research interests include space launch equipment assurance and deep learning algorithms

Tang Lingtao: born in 1994. PhD. His main research interests include machine learning, privacy protection, and intelligent operating systems

Zhang Lufei: born in 1986. PhD, senior engineer. His main research interests include high-performance computing, operating systems, and machine learning

Qin Xiaojun: born in 1975. PhD. Full senior engineer. His main research interests include system security, software vulnerability analysis, and artificial intelligence

Liu Xin: born in 1979. PhD, professor. Her main research interest includes parallel algorithms and applications

Chen Zuoning: born in 1957. PhD, PhD supervisor, Academician of Chinese Academy of Engineering. Her main research interests include software theory, operating systems, and information security
Received Date: February 28, 2025
Revised Date: April 07, 2025
Available Online: April 13, 2025

Graphical Abstract

Abstract

Abstract

The number of large-scale language model parameters is growing exponentially, which puts forward higher requirements on the arithmetic density and communication efficiency of the acceleration card, and promotes the rapid development of new architectures, such as single-card multi-core, multi-chip and multi-communication entities. Sunway AI acceleration card adopts four-core group on-chip Ring bus architecture, but in the large model training, the data communication volume is large, and the traditional Ring collection communication method faces the core bottlenecks such as the double limitation of single-core group memory capacity and transmission bandwidth, low collection communication efficiency, and the inability to overlap the communication and computation. In this paper, the optimization framework of SW-IntraCC (Sunway-intra collective communication) is proposed by adopting the concept of software-hardware collaborative design to break through the above limitations through the three-tier storage architecture. First, the three-tier storage architecture is constructed based on on-chip high-speed Ring network, which expands the memory capacity of a single core group by up to four times and increases the host-accelerator card transmission bandwidth by three times; Second, an intra-chip cross shared communication (CSC) algorithm is designed with interleaved memory access patterns, implementing CSC-AG (CSC-AllGather) and CSC-RS (CSC-ReduceScatter) operators optimized for large model training. Benchmark results demonstrate that CSC achieves 2.15 times higher communication efficiency compared with conventional collective primitives. Finally, a bidirectional operator fusion strategy is proposed to enable communication-computation overlap, yielding a 59% improvement in communication performance after optimization.
- collective communications,
- Sunway AI acceleration card,
- SW-IntraCC,
- communications optimization,
- ring network

FullText(HTML)

References (27)

References

[1]	王睿,张留洋,高志涌,等. 面向边缘智能的大模型研究进展[J/OL]. 计算机研究与发展,[2025-02-07]. http://kns.cnki.net/kcms/detail/11.1777.TP.20250127.0921.006.html Wang Rui, Zhang Liuyang, Gao Zhiyong, et al. Progress in research on large models for edge-oriented intelligence[J/OL]. Computer Research and Development, [2025-02-07]. http://kns.cnki.net/kcms/detail/11.1777.TP.20250127.0921.006.html (in Chinese)
[2]	DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning[J]. arXiv preprint, arXiv: 2501.12948, 2025. DOI: 10.48550/arXiv.2501.12948
[3]	Ouyang S, Dong D Z, Xu Y M, et al. Communication optimization strategies for distributed deep neural network training: A survey[J]. Journal of Parallel and Distributed Computing, 2021, 149: 52–65. DOI: 10.1016/j.jpdc.2020.11.005
[4]	Weingram A, Li Y, Qi H, et al. xCCL: A survey of industry-led collective communication libraries for deep learning[J]. Journal of Computer Science and Technology, 2023, 38: 166−195 doi: 10.1007/s11390-023-2894-6
[5]	NVIDIA. NVIDIA DGX−1 System Architecture White Paper 2017[R/OL].[2024-12-30]. https://images.nvidia.cn/content/pdf/dgx1-system-architecture-whitepaper1.pdf
[6]	Tipparaju V, Nieplocha J, Panda D K. (2003). Fast collective operations using shared and remote memory access protocols on clusters[C]//Proc of Int Parallel and Distributed Processing Symposium. Washington, DC: IEEE Computer Society, 2003: 10
[7]	Pabst S, Koch A, Straßer W. Fast and scalable CPU/GPU collision detection for rigid and deformable surfaces[J]. Computer Graphics Forum, 2010, 29(7): 2145−2154 doi: 10.1111/j.1467-8659.2010.01802.x
[8]	Xu Q, Jeon H, Annavaram M. Graph processing on GPUs: Where are the bottlenecks?[C]//Proc of 2014 IEEE Int Symp on Workload Characterization (IISWC). Piscataway, NJ: IEEE, 2014: 140−149
[9]	Next Platform. The system bottleneck shifts to PCI-express[R/OL] [2017-07-14]. https://www.nextplatform.com/2017/07/14/system-bottleneck-shifts-pci-express/
[10]	黄彤彤, 陈昊, 武辰飞, 等. Concurrent multi-die optimization物理实现方案的应用[J]. 电子技术应用,2023,49(8):30−35 Huang Tongtong, Chen Hao, Wu Chenfei, et al. Application of physical implementation scheme for Concurrent multi-die optimization[J]. Electronics Applications, 2023, 49(8): 30−35 (in Chinese)
[11]	曾嵘浩. 基于海上的的多目标跟踪与行为分析研究[D]. 西安:西安电子科技大学,2023. DOI: 10.27389/d.cnki.gxadu.2023.001428 Zeng Ronghao. Research on multi-target tracking and behavior analysis based on maritime[D]. Xi’an: Xi’an Electronic Science and Technology University, 2023. DOI:10.27389/d.cnki.gxadu.2023.001428 (in Chinese)
[12]	俞铭辉. 基于国产智能芯片的模型移植和优化的研究[D]. 杭州:杭州电子科技大学,2024. DOI: 10.27075/d.cnki.ghzdc.2024.000968 Yu Minghui. Research on model transplantation and optimization based on domestic smart chip[D]. Hangzhou: Hangzhou University of Electronic Science and Technology, 2024. DOI:10.27075/d.cnki.ghzdc.2024.000968 (in Chinese)
[13]	赵玉龙, 张鲁飞, 许国春, 等. SDAA:面向申威智能加速卡的运行时系统[J]. 软件学报,2024,35(12):5710−5724 Zhao Yulong, Zhang Lufei, Xu Guochun, et al. SDAA: Runtime system for Sunway intelligent acceleration card[J]. Journal of Software, 2024, 35(12): 5710−5724 (in Chinese)
[14]	Wikipedia. RDMA over converged Ethernet[EB/OL]. Wikipedia, the free encyclopedia[2024-12-01]. https://en.wikipedia.org/w/index.php?title=RDMA_over_Converged_ Ethernet&oldid=782744462, 2017.
[15]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Conf on Neural Information Processing Systems (NIPS 2017). Long Beach, CA: Curran Associates Inc, 2017: 5998−6008
[16]	Sergeev A, Balso M D. Horovod: Fast and easy distributed deep learning in TensorFlow[J]. arXiv preprint, arXiv: 1802.05799, 2018
[17]	Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint, arXiv 1909.08053, 2019
[18]	Huang Y, Cheng Y, Bapna A, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism[C]//Advances in the 32nd Conf on Neural Information Processing Systems (NeurIPS 2019). Vancouver, CA: Curran Associates Inc 2019: 8−14
[19]	Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: Generalized pipeline parallelism for DNN training[C/OL]//Proc of the 27th ACM Symp on Operating Systems Principles New York: ACM[2025-01-15]. https://doi.org/10.48550/arXiv.1806.03377, 2019
[20]	Li S, Xue F, Li Y, et al. Sequence parallelism: Long sequence training from system perspective[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada: 2021: 2391−2404
[21]	高亦沁,罗智宇,王一超,等. 面向国产超算的操作系统评测与优化[J/OL]. 计算机科学,[2025-02-11]. http://kns.cnki.net/kcms/detail/50.1075.tp.20240925.1330.007.html Gao Yiqin, Luo Zhiyu, Wang Yichao, et al. Evaluation and optimization of operating system for domestic supercomputing[J/OL]. Computer Science, [2025-02-11]. http://kns.cnki.net/kcms/detail/50.1075.tp.20240925.1330.007.html (in Chinese)
[22]	Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C]//Proc of the 11th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583−598
[23]	Träff J L. Optimal, non-pipelined Reduce-scatter and Allreduce algorithms[J]. arXiv preprint, arXiv: 2410.14234, 2024
[24]	Wasi-ur-Rahman M, Lu X, Islam N S, et al. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects[C]//Proc of the Int Conf on Supercomputing (ICS). New York: ACM 2014: 33−42. DOI: 10.1145/2597652.2597684
[25]	Ba J, Kiros J R, Hinton G E. Layer normalization[J]. arXiv preprint, arXiv: 1607.06450, 2016
[26]	Black S, Biderman S, Hallahan E, et al. GPT-NeoX−20B: An open-source autoregressive language model[J]. arXiv preprint, arXiv: 2204.06745, 2022
[27]	Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023

[1]	Jin Dongming, Jin Zhi, Chen Xiaohong, Wang Chunhui. ChatModeler: A Human-Machine Collaborative and Iterative Requirements Elicitation and Modeling Approach via Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(2): 338-350. DOI: 10.7544/issn1000-1239.202330746
[2]	Wang Juanjuan, Wang Hongan. Multi-Agent Multi-Criticality Scheduling Based Self-Healing System of Power Grid[J]. Journal of Computer Research and Development, 2017, 54(4): 720-730. DOI: 10.7544/issn1000-1239.2017.20161026
[3]	He Wenbin, Liu Qunfeng, Xiong Jinzhi. The Error Theory of Polynomial Smoothing Functions for Support Vector Machines[J]. Journal of Computer Research and Development, 2016, 53(7): 1576-1585. DOI: 10.7544/issn1000-1239.2016.20148462
[4]	He Wangquan, Wei Di, Quan Jianxiao, Wu Wei, Qi Fengbin. Dynamic Task Scheduling Model and Fault-Tolerant via Queuing Theory[J]. Journal of Computer Research and Development, 2016, 53(6): 1271-1280. DOI: 10.7544/issn1000-1239.2016.20148445
[5]	Zhao Yu, Wang Yadi, Han Jihong, Fan Yudan, and Zhang Chao. A Formal Model for Cryptographic Protocols Based on Planning Theory[J]. Journal of Computer Research and Development, 2008, 45(9).
[6]	Shi Jin, Lu Yin, and Xie Li. Dynamic Intrusion Response Based on Game Theory[J]. Journal of Computer Research and Development, 2008, 45(5): 747-757.
[7]	Li Ye, Cai Yunze, Yin Rupo, Xu Xiaoming. Support Vector Machine Ensemble Based on Evidence Theory for Multi-Class Classification[J]. Journal of Computer Research and Development, 2008, 45(4): 571-578.
[8]	Lin Jianning, Wu Huizhong. Research on a Trust Model Based on the Subjective Logic Theory[J]. Journal of Computer Research and Development, 2007, 44(8): 1365-1370.
[9]	He Lijian and Zhang Wei. An Agent Organization Structure for Solving DCOP Based on the Partitions of Constraint Graph[J]. Journal of Computer Research and Development, 2007, 44(3).
[10]	Mu Kedian and Lin Zuoquan. Symbolic Dempster-Shafer Theory[J]. Journal of Computer Research and Development, 2005, 42(11): 1833-1842.