SW-IntraCC: A Collective Communication Mechanism for the Internals of Sunway AI Acceleration

Zhao Yulong; Gu Yanqing; Tian Songtao; Wu Chunzhi; Tang Lingtao; Zhang Lufei; Qin Xiaojun; Liu Xin; Chen Zuoning

doi:10.7544/issn1000-1239.202550143

Zhao Yulong, Gu Yanqing, Tian Songtao, Wu Chunzhi, Tang Lingtao, Zhang Lufei, Qin Xiaojun, Liu Xin, Chen Zuoning. SW-IntraCC: A Collective Communication Mechanism for the Internals of Sunway AI AccelerationJ. Journal of Computer Research and Development, 2025, 62(6): 1333-1346. DOI: 10.7544/issn1000-1239.202550143

Citation:

SW-IntraCC: A Collective Communication Mechanism for the Internals of Sunway AI Acceleration

Graphical Abstract

Graphical Abstract

Abstract

Abstract

The number of large-scale language model parameters is growing exponentially, which puts forward higher requirements on the arithmetic density and communication efficiency of the acceleration card, and promotes the rapid development of new architectures, such as single-card multi-core, multi-chip and multi-communication entities. Sunway AI acceleration card adopts four-core group on-chip Ring bus architecture, but in the large model training, the data communication volume is large, and the traditional Ring collection communication method faces the core bottlenecks such as the double limitation of single-core group memory capacity and transmission bandwidth, low collection communication efficiency, and the inability to overlap the communication and computation. In this paper, the optimization framework of SW-IntraCC (Sunway-intra collective communication) is proposed by adopting the concept of software-hardware collaborative design to break through the above limitations through the three-tier storage architecture. First, the three-tier storage architecture is constructed based on on-chip high-speed Ring network, which expands the memory capacity of a single core group by up to four times and increases the host-accelerator card transmission bandwidth by 2.5 times; Second, an intra-chip cross shared communication (CSC) algorithm is designed with interleaved memory access patterns, implementing CSC-AG (CSC-AllGather) and CSC-RS (CSC-ReduceScatter) operators optimized for large model training. Benchmark results demonstrate that CSC achieves 2.15 times higher communication efficiency compared with conventional collective primitives. Finally, a bidirectional operator fusion strategy is proposed to enable communication-computation overlap, yielding a 59% improvement in communication performance after optimization.

FullText(HTML)

References (27)

Supplements (1)

Cited By

Turn off MathJax

Article Contents

SW-IntraCC: A Collective Communication Mechanism for the Internals of Sunway AI Acceleration

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content