• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

PetS: 针对参数高效Transformer模型的可扩展推理服务系统

魏学超, 周哲, 徐盈辉, 张洁靖, 谢源, 孙广宇

魏学超, 周哲, 徐盈辉, 张洁靖, 谢源, 孙广宇. PetS: 针对参数高效Transformer模型的可扩展推理服务系统[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202440206
引用本文: 魏学超, 周哲, 徐盈辉, 张洁靖, 谢源, 孙广宇. PetS: 针对参数高效Transformer模型的可扩展推理服务系统[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202440206
PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440206
Citation: PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440206

PetS: 针对参数高效Transformer模型的可扩展推理服务系统

基金项目: 省部级-广东省重点领域研发计划项目(2021B0101310002)

PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers

  • 摘要: 在多任务推理服务场景下使用基于预训练-微调范式的Transformer 模型存在很多困难:服务端必须维护每个下游任务的完整模型副本,从而造成很大的存储和显存开销。最近逐渐兴起的参数高效 Transformer (PETs) 算法在不同的下游任务之间共享预训练模型,仅微调一小部分任务特定的模型参数,从而减少存储的开销。然而,现有的后端服务系统既没有灵活的 PET 任务管理机制,也不能有效地跨任务进行输入的批量处理。针对不同的下游任务,现有框架在多卡分布式场景下也难以提供良好的负载均衡机制。因此,我们提出了 PetS,一个用于多任务 PET 推理服务的可扩展框架。具体而言,不同的 PET 任务在算法上被抽象成一种统一表示形式。基于这种统一表示,我们设计了一个专门的 PET 推理引擎,以批处理不同任务的输入,并使用任务无关的共享算子和任务特定的 PET 算子进行推理。通过PET 推理引擎,PetS 在单个 GPU 设备上可以支持更多的任务数量。为了进一步提高系统吞吐量,我们提出了一种协同批处理策略,同时考虑了输入的长度、PET 任务类型以及系统负载平衡。为了提升多卡部署的负载均衡,我们创新性地提出了基于PET实时迁移的负载均衡机制。我们在多个平台上评估了 PetS,包括 边缘端、桌面端和服务器端 GPU。全面的实验证明,PetS 支持多达 26 倍的并发任务,并将服务吞吐量在桌面和服务器 GPU 节点上分别提高了 1.53 倍和 1.63 倍。在多 GPU 场景下,我们的负载均衡策略可以将吞吐量进一步提升29%之多。
    Abstract: Deploying Transformer models under the conventional pre-train-then-fine-tune paradigm is challenging for multi-task serving, because a full model copy for each downstream task must be maintained, quickly exhausting the storage budget. Recent algorithmic advances in Parameter-Efficient Transformers (PETs) have shown enormous potential to mitigate the storage overhead. They share the pre-trained model among tasks and only fine-tune a small portion of task-specific parameters. Unfortunately, existing serving systems neither have flexible PET task management mechanisms nor can efficiently serve queries to different tasks in batches. Therefore, we propose PetS, a unified framework for multi-task PETs serving. Specifically, different PET tasks are expressed by a unified representation in the same framework, which enables flexible PET task management. Based on the unified representation, we design a specialized PET inference engine to batch different tasks' queries together and execute them with task-agnostic shared operators and task-specific PET operators. Equipped with the PET inference engine, PetS is more scalable with respect to the number of tasks on a single GPU device. To further improve system throughput, we propose a coordinated batching strategy taking query length, PET task type as well as system load balancing together into consideration. To improve the throughput on multiple GPU instances, we also propose a PET-migration based load balancing strategy. We evaluate PetS on platforms with single GPU, including Edge/Desktop/Server GPUs. Comprehensive experiments demonstrate that PetS supports up to 26x more concurrent tasks and improves the serving throughput by 1.53x and 1.63x on Desktop and Server GPU nodes, respectively. On multiple GPUs, our load-balancing strategy also provides up to 29% speedup.
计量
  • 文章访问数:  10
  • HTML全文浏览量:  0
  • PDF下载量:  2
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-03-20
  • 网络出版日期:  2025-03-02

目录

    /

    返回文章
    返回