SW-Inf: Efficient Transformer Inference Method for Sunway AI Accelerator Card

Shen Wenyuan; Tang Lingtao; He Haotian; Fei Tiancheng; Zhao Lulong; Zhang Lufei; Qin Xiaojun; Liu Xin; Chen Zuoning

doi:10.7544/issn1000-1239.202550153

Shen Wenyuan, Tang Lingtao, He Haotian, Fei Tiancheng, Zhao Lulong, Zhang Lufei, Qin Xiaojun, Liu Xin, Chen Zuoning. SW-Inf: Efficient Transformer Inference Method for Sunway AI Accelerator Card[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550153

Citation:

SW-Inf: Efficient Transformer Inference Method for Sunway AI Accelerator Card

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Transformer has fundamentally revolutionized various fields of natural language processing through its self-attention mechanism that effectively models long-range contextual relationships. It overcomes traditional neural networks’ limitations in handling long sequences, and has emerged as the mainstream architecture for modern large language models. Currently, Transformer inference systems generally face a shortage of computing power and are commonly confronted with challenges like substantial scheduling latency, concurrency execution difficulties, and severe memory access limitations during deployment. The Sunway AI accelerator card, featuring a domestically-developed next-generation intelligent processor, has successfully empowered diverse edge inference deployments. However, conventional inference frameworks struggle to effectively explore the architectural potential in three critical dimensions: robust system control capabilities, multi-level parallel computing, and rich memory hierarchy. This paper proposes SW-Inf: a high-efficiency Transformer inference method based on the Sunway AI accelerator card. First, an on-chip management processing element (MPE)-orchestrated computation graph scheduling mechanism is proposed to minimize kernel launch latency and inter-process data transfer overhead through graph aggregation and MPE coordination. Second, an LDM-prioritized heuristic tiling strategy is introduced to enable load-balanced parallel computation across processing elements. Finally, a hardware-aware tightly-coupled attention computation implementation is developed that eliminates redundant intermediate data movement between registers and main memory. We implemented an end-to-end inference system on the Sunway AI accelerator card. Experimental results demonstrate that compared with the vLLM inference system on NVIDIA V100S which has higher memory bandwidth than the Sunway AI accelerator card, SW-Inf achieves 15% higher throughput and 20% better memory utilization rate.

FullText(HTML)

References (27)

Cited By

Turn off MathJax

Article Contents

SW-Inf: Efficient Transformer Inference Method for Sunway AI Accelerator Card

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content