SW-Inf：基于申威智能加速卡的Transformer高效推理方法

沈文渊; 汤凌韬; 何昊天; 费天成; 赵玉龙; 张鲁飞; 秦晓军; 刘鑫; 陈左宁

doi:10.7544/issn1000-1239.202550153

SW-Inf：基于申威智能加速卡的Transformer高效推理方法

SW-Inf: Efficient Transformer Inference Method for Sunway AI Accelerator Card

摘要

摘要: Transformer通过自注意力机制高效捕捉全局依赖，突破了传统神经网络处理长序列信息时的限制，推动了自然语言处理多个领域的革命性进展，成为现代大语言模型的核心架构. 目前，国内Transformer推理应用整体上处于算力紧缺状态，且在部署时往往面临调度开销大、计算并发难、访存瓶颈明显等问题. 申威智能加速卡搭载了国产自主研发的新一代智能处理器，已有效支撑各类端侧Transformer推理应用，然而常规推理框架难以发挥申威异构众核架构的管理功能完备、多级并行计算和存储层次丰富等特点. 对此，提出一种基于申威智能加速卡的Transformer高效推理方法SW-Inf. 首先，提出了片上主核驱动的整图下沉调度机制，通过计算图打包和片上主核调度，减少主机与设备间的核函数启动开销和设备内各流程数据传输开销；其次，设计了LDM优先的启发式分块并行策略，实现计算部件负载均衡，提高了硬件资源利用率；最后，实现了体系结构紧耦合的注意力计算方法，避免推理的中间结果频繁读写主存和寄存器，缓解端侧推理应用的访存瓶颈问题. 为验证所提方法的效果，基于申威智能加速卡实现了一套端到端推理系统，测试结果表明，相较于访存带宽更高的英伟达V100S上的vLLM推理系统，SW-Inf的吞吐率和主存利用率分别提高了15%和20%.

Abstract: Transformer has fundamentally revolutionized various fields of natural language processing through its self-attention mechanism that effectively models long-range contextual relationships. It overcomes traditional neural networks’ limitations in handling long sequences, and has emerged as the mainstream architecture for modern large language models. Currently, Transformer inference systems generally face a shortage of computing power and are commonly confronted with challenges like substantial scheduling latency, concurrency execution difficulties, and severe memory access limitations during deployment. The Sunway AI accelerator card, featuring a domestically-developed next-generation intelligent processor, has successfully empowered diverse edge inference deployments. However, conventional inference frameworks struggle to effectively explore the architectural potential in three critical dimensions: robust system control capabilities, multi-level parallel computing, and rich memory hierarchy. This paper proposes SW-Inf: a high-efficiency Transformer inference method based on the Sunway AI accelerator card. First, an on-chip management processing element (MPE)-orchestrated computation graph scheduling mechanism is proposed to minimize kernel launch latency and inter-process data transfer overhead through graph aggregation and MPE coordination. Second, an LDM-prioritized heuristic tiling strategy is introduced to enable load-balanced parallel computation across processing elements. Finally, a hardware-aware tightly-coupled attention computation implementation is developed that eliminates redundant intermediate data movement between registers and main memory. We implemented an end-to-end inference system on the Sunway AI accelerator card. Experimental results demonstrate that compared with the vLLM inference system on NVIDIA V100S which has higher memory bandwidth than the Sunway AI accelerator card, SW-Inf achieves 15% higher throughput and 20% better memory utilization rate.

HTML全文

参考文献(27)

施引文献

资源附件(0)