SW-Inf: Efficient Transformer Inference Method for Sunway AI Accelerator Card
-
Graphical Abstract
-
Abstract
Transformer has fundamentally revolutionized various fields of natural language processing through its self-attention mechanism that effectively models long-range contextual relationships. It overcomes traditional neural networks' limitations in handling?long sequences, and has emerged as the mainstream architecture for modern large language models.?Currently, Transformer inference systems generally face a shortage of computing power and are commonly confronted with challenges like substantial scheduling latency, concurrency execution difficulties, and severe memory access limitations?during deployment. The Sunway AI accelerator card, featuring a domestically-developed next-generation intelligent processor, has successfully empowered diverse edge inference deployments. However, conventional inference frameworks struggle to effectively explore?the architectural potential in three critical dimensions: robust system control capabilities, multi-level parallel computing?, and rich memory hierarchy. This paper proposes SW-Inf: a high-efficiency Transformer inference method based on the Sunway AI accelerator card. First, an on-chip management processing element (MPE)-orchestrated computation graph scheduling mechanism is proposed to minimize kernel launch latency and inter-process data transfer overhead through graph aggregation and MPE coordination. Second, an LDM-prioritized heuristic tiling strategy is introduced to enable load-balanced parallel computation across processing elements. Finally, a hardware-aware tightly-coupled attention computation implementation is developed that eliminates redundant intermediate data movement between registers and main memory. We implemented an end-to-end inference system on the Sunway AI accelerator card. Experimental results demonstrate that compared with the vLLM inference system on NVIDIA V100S which has higher memory bandwidth?than the Sunway AI accelerator card, SW-Inf achieves 15% higher throughput and 20% better memory utilization rate.
-
-