Abstract:
The number of large-scale language model parameters is growing exponentially, which puts forward higher requirements on the arithmetic density and communication efficiency of the acceleration card, and promotes the rapid development of new architectures, such as single-card multi-core, multi-chip and multi-communication entities. Sunway AI acceleration card adopts four-core group on-chip ring bus architecture, but in the large model training, the data communication volume is large, and the traditional Ring collection communication method faces the core bottlenecks such as the double limitation of single-core group memory capacity and transmission bandwidth, low collection communication efficiency, and the inability to overlap the communication and computation. In this paper, the optimization framework of SW-IntraCC (Sunway-Intra Collective Communication) was proposed by adopting the concept of software-hardware collaborative design to break through the above limitations through the three-tier storage architecture. First, the three-tier storage architecture was constructed based on on-chip high-speed ring network, which expands the memory capacity of a single core group by up to 4 times and increases the host-accelerator card transmission bandwidth by 3 times; Second, an intra-chip Cross Shared Communication (CSC) algorithm was designed with interleaved memory access patterns, implementing CSC-AG (CSC-AllGather) and CSC-RS (CSC-ReduceScatter) operators optimized for large model training. Benchmark results demonstrate that CSC achieves 2.15 times higher communication efficiency compared to conventional collective primitives. Finally, a bidirectional operator fusion strategy was proposed to enable communication-computation overlap, yielding a 59% improvement in communication performance after optimization.