高级检索

    SW-IntraCC:一种面向申威智能加速卡内部的集合通信机制

    SW-IntraCC: A Collective Communication Mechanism for Sunway AI Acceleration Card Internals

    • 摘要: 大规模语言模型参数量呈指数级增长趋势,对加速卡算力密度与通信效率提出更高要求,推动单卡多芯粒、多芯片及多通信实体等新型架构的快速发展. 申威智能加速卡采用4个核组片上环网架构,但在大模型训练中,数据通信量大、卡内传统Ring集合通信方式面临单核组显存容量与传输带宽双重限制、卡内集合通信效率低、通信计算无法重叠等核心瓶颈. 本文采用软硬协同设计理念提出SW-IntraCC(Sunway-Intra collective communication)的优化框架,通过三级存储架构突破上述限制. 首先,基于片上高速环网构建三级存储架构,单核组显存容量最高扩大至4倍,主机-加速卡传输带宽提升3倍;其次,设计采用交叉共享访存的片内高效CSC(cross shared communication)通信算法,实现面向大模型训练的典型通信算子CSC-AG(CSC-AllGather)和CSC-RS(CSC-ReduceScatter),通信效率是传统方式的2.15倍;最后,提出双向算子融合的通信计算重叠方法,实现通信与计算重叠,优化后通信性能提升59%.

       

      Abstract: The number of large-scale language model parameters is growing exponentially, which puts forward higher requirements on the arithmetic density and communication efficiency of the acceleration card, and promotes the rapid development of new architectures, such as single-card multi-core, multi-chip and multi-communication entities. Sunway AI acceleration card adopts four-core group on-chip ring bus architecture, but in the large model training, the data communication volume is large, and the traditional Ring collection communication method faces the core bottlenecks such as the double limitation of single-core group memory capacity and transmission bandwidth, low collection communication efficiency, and the inability to overlap the communication and computation. In this paper, the optimization framework of SW-IntraCC (Sunway-Intra Collective Communication) was proposed by adopting the concept of software-hardware collaborative design to break through the above limitations through the three-tier storage architecture. First, the three-tier storage architecture was constructed based on on-chip high-speed ring network, which expands the memory capacity of a single core group by up to 4 times and increases the host-accelerator card transmission bandwidth by 3 times; Second, an intra-chip Cross Shared Communication (CSC) algorithm was designed with interleaved memory access patterns, implementing CSC-AG (CSC-AllGather) and CSC-RS (CSC-ReduceScatter) operators optimized for large model training. Benchmark results demonstrate that CSC achieves 2.15 times higher communication efficiency compared to conventional collective primitives. Finally, a bidirectional operator fusion strategy was proposed to enable communication-computation overlap, yielding a 59% improvement in communication performance after optimization.

       

    /

    返回文章
    返回