SW-IntraCC：一种面向申威智能加速卡内部的集合通信机制

赵玉龙; 顾燕卿; 田松涛; 吴春志; 汤凌韬; 张鲁飞; 秦晓军; 刘鑫; 陈左宁

doi:10.7544/issn1000-1239.202550143

SW-IntraCC：一种面向申威智能加速卡内部的集合通信机制

SW-IntraCC: A Collective Communication Mechanism for the Internals of Sunway AI Acceleration

摘要

摘要: 大规模语言模型参数量呈指数级增长趋势，对加速卡算力密度与通信效率提出更高要求，推动单卡多芯粒、多芯片及多通信实体等新型架构的快速发展. 申威智能加速卡采用4个核组片上环网架构，但在大模型训练中，数据通信量大和卡内传统Ring集合通信方式面临单核组显存容量与传输带宽双重限制、卡内集合通信效率低、通信计算无法重叠等核心瓶颈. 采用软硬协同设计理念提出SW-IntraCC（Sunway-intra collective communication）的优化框架，通过三级存储架构突破上述限制. 首先，基于片上高速环网构建三级存储架构，单核组显存容量最高扩大至4倍，主机-加速卡传输带宽提升2.5倍；其次，设计采用交叉共享访存的片内高效CSC（cross shared communication）通信算法，实现面向大模型训练的典型通信算子CSC-AG（CSC-AllGather）和CSC-RS（CSC-ReduceScatter），通信效率是传统方式的2.15倍；最后，提出双向算子融合的通信计算重叠方法，实现通信与计算重叠，优化后通信性能提升59%.

Abstract: The number of large-scale language model parameters is growing exponentially, which puts forward higher requirements on the arithmetic density and communication efficiency of the acceleration card, and promotes the rapid development of new architectures, such as single-card multi-core, multi-chip and multi-communication entities. Sunway AI acceleration card adopts four-core group on-chip Ring bus architecture, but in the large model training, the data communication volume is large, and the traditional Ring collection communication method faces the core bottlenecks such as the double limitation of single-core group memory capacity and transmission bandwidth, low collection communication efficiency, and the inability to overlap the communication and computation. In this paper, the optimization framework of SW-IntraCC (Sunway-intra collective communication) is proposed by adopting the concept of software-hardware collaborative design to break through the above limitations through the three-tier storage architecture. First, the three-tier storage architecture is constructed based on on-chip high-speed Ring network, which expands the memory capacity of a single core group by up to four times and increases the host-accelerator card transmission bandwidth by 2.5 times; Second, an intra-chip cross shared communication (CSC) algorithm is designed with interleaved memory access patterns, implementing CSC-AG (CSC-AllGather) and CSC-RS (CSC-ReduceScatter) operators optimized for large model training. Benchmark results demonstrate that CSC achieves 2.15 times higher communication efficiency compared with conventional collective primitives. Finally, a bidirectional operator fusion strategy is proposed to enable communication-computation overlap, yielding a 59% improvement in communication performance after optimization.

HTML全文

参考文献(27)

施引文献

资源附件(1)