PetS: 针对参数高效Transformer模型的可扩展推理服务系统

魏学超; 周哲; 徐盈辉; 张洁靖; 谢源; 孙广宇

doi:10.7544/issn1000-1239.202440206

PetS: 针对参数高效Transformer模型的可扩展推理服务系统

PetS: A Scalable Inference Serving System for Parameter-Efficient Transformer

摘要

摘要: 在多任务推理服务场景下使用基于预训练-微调范式的Transformer 模型存在很多困难：服务端必须维护每个下游任务的完整模型副本，从而造成很大的存储和显存开销. 最近逐渐兴起的参数高效 Transformer (PET) 算法在不同的下游任务之间共享预训练模型，仅微调一小部分任务特定的模型参数，从而减少存储的开销. 然而，现有的后端服务系统既没有灵活的 PET 任务管理机制，也不能有效地跨任务进行输入的批量处理. 针对不同的下游任务，现有框架在多卡分布式场景下也难以提供良好的负载均衡机制. 因此，提出了PetS，一个用于多任务 PET 推理服务的可扩展框架. 具体而言，不同的 PET 任务在算法上被抽象成一种统一表示形式. 基于这种统一表示，设计了一个专门的 PET 推理引擎，以批处理不同任务的输入，并使用任务无关的共享算子和任务特定的 PET 算子进行推理. 通过PET 推理引擎，PetS 在单个 GPU 设备上可以支持更多的任务数量. 为了进一步提高系统吞吐量，提出了一种协同批处理（CB）策略，同时考虑了输入的长度、PET 任务类型以及系统负载平衡. 为了提升多卡部署的负载均衡，创新性地提出了基于PET实时迁移的负载均衡机制. PetS在包括边缘端、桌面端和服务器端 GPU等多个平台上都经过了评估. 全面的实验证明，PetS 支持多达 26 倍的并发任务，并将服务吞吐量在桌面和服务器 GPU 节点上分别提高了 1.53 倍和 1.63 倍. 在多 GPU 场景下，该负载均衡策略可以将吞吐量进一步提升29%之多.

Abstract: Deploying Transformer models under the conventional pre-train-then-fine-tune paradigm is challenging for multi-task serving, because a full model copy for each downstream task must be maintained, quickly exhausting the storage budget. Recent algorithmic advances in parameter-efficient Transformer (PET) have shown enormous potential to mitigate the storage overhead. They share the pre-trained model among tasks and only fine-tune a small portion of task-specific parameters. Unfortunately, existing serving systems neither have flexible PET task management mechanisms nor can efficiently serve queries to different tasks in batches. Therefore, we propose PetS, a unified framework for multi-task PET serving. Specifically, different PET tasks are expressed by a unified representation in the same framework, which enables flexible PET task management. Based on the unified representation, we design a specialized PET inference engine to batch different tasks’ queries together and execute them with task-agnostic shared operators and task-specific PET operators. Equipped with the PET inference engine, PetS is more scalable with respect to the number of tasks on a single GPU device. To further improve system throughput, we propose a coordinated batching strategy taking query length, PET task type as well as system load balancing together into consideration. To improve the throughput on multiple GPU instances, we also propose a PET-migration based load balancing strategy. We evaluate PetS on platforms with single GPU, including Edge/Desktop/Server GPUs. Comprehensive experiments demonstrate that PetS supports up to 26 times more concurrent tasks and improves the serving throughput by 1.53 times and 1.63 times on desktop and server GPU nodes, respectively. On multiple GPUs, our load-balancing strategy also provides up to 29% speedup.

HTML全文

参考文献(32)

施引文献

资源附件(0)