• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Wei Xuechao, Zhou Zhe, Xu Yinghui, Zhang Jiejing, Xie Yuan, Sun Guangyu. PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440206
Citation: Wei Xuechao, Zhou Zhe, Xu Yinghui, Zhang Jiejing, Xie Yuan, Sun Guangyu. PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440206

PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers

Funds: This work was supported by the National Natural Science Foundation of China (62032001) and the Higher Education Discipline Innovation Project (B18001).
More Information
  • Author Bio:

    Wei Xuechao: born in 1988. PhD. Joint postdoc researcher of Peking University and Alibaba DAMO Academy. His main research interests include computer architecture and electronic design automation

    Zhou Zhe: born in 1997. PhD. His main research interests include computer architecture, near data processing, domain specific accelerator, machine learning system and interconnection techniques

    Xu Yinghui: born in 1973. PhD. PhD supervisor. His main research interests include generative AI, logic reasoning, large language model for life science

    Zhang Jiejing: born in 1986. Director of Alibaba Tongyi lab heterogeneous computing team. His main research interests include deep learning inference

    Xie Yuan: born in 1974. PhD. IEEE fellow, ACM fellow, AAAS fellow, Chair professor. His main research interests include computer architecture, VLSI, electronic design automation, machine learning

    Sun Guangyu: born in 1981. PhD, professor, PhD supervisor. His main research interests include electronic design automation for domain specific architecture and new devices, architecture/circuit/device codesign and automation

  • Received Date: March 20, 2024
  • Revised Date: February 17, 2025
  • Accepted Date: March 02, 2025
  • Available Online: March 02, 2025
  • Deploying Transformer models under the conventional pre-train-then-fine-tune paradigm is challenging for multi-task serving, because a full model copy for each downstream task must be maintained, quickly exhausting the storage budget. Recent algorithmic advances in Parameter-Efficient Transformers (PETs) have shown enormous potential to mitigate the storage overhead. They share the pre-trained model among tasks and only fine-tune a small portion of task-specific parameters. Unfortunately, existing serving systems neither have flexible PET task management mechanisms nor can efficiently serve queries to different tasks in batches. Therefore, we propose PetS, a unified framework for multi-task PETs serving. Specifically, different PET tasks are expressed by a unified representation in the same framework, which enables flexible PET task management. Based on the unified representation, we design a specialized PET inference engine to batch different tasks' queries together and execute them with task-agnostic shared operators and task-specific PET operators. Equipped with the PET inference engine, PetS is more scalable with respect to the number of tasks on a single GPU device. To further improve system throughput, we propose a coordinated batching strategy taking query length, PET task type as well as system load balancing together into consideration. To improve the throughput on multiple GPU instances, we also propose a PET-migration based load balancing strategy. We evaluate PetS on platforms with single GPU, including Edge/Desktop/Server GPUs. Comprehensive experiments demonstrate that PetS supports up to 26x more concurrent tasks and improves the serving throughput by 1.53x and 1.63x on Desktop and Server GPU nodes, respectively. On multiple GPUs, our load-balancing strategy also provides up to 29% speedup.

  • [1]
    Devlin J, Chang M, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [C] //Conf of the 34th North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 4171−4186
    [2]
    Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners [EB/OL]. San Francisco, CA: OpenAI, 2019[2024-10-04]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
    [3]
    Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners [C] //Proc of the 34th Int Conf on Neural Information Processing Systems. New York: ACM, 2020: 1877−1901
    [4]
    Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training [EB/OL]. San Francisco, CA: OpenAI, 2018[2024-10-04]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
    [5]
    Liu Yinhan, Ott M, Goyal N, et al. Roberta: A robustly optimized BERT pretraining approach [J]. arXiv preprint, arXiv: 1907.11692, 2019
    [6]
    Yang Zhilin, Dai Zihang, Yang Yiming, et al. XLNet: Generalized autoregressive pretraining for language understanding [C] //Proc of the 33rd Int Conf on Neural Information Processing Systems. New York: ACM, 2019: 5753−5763
    [7]
    Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485−5551
    [8]
    Zhang Sunan, Roller S, Goyal N, et al. Opt: Open pre-trained transformer language models [J]. arXiv preprint, arXiv: 2205.01068, 2022
    [9]
    Liu Ze, Lin Yutong, Cao Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows [C] //Proc of the 35th IEEE/CVF Int Conf on Computer Vision (ICCV). Piscataway, NJ: IEEE, 2021: 10012−10022
    [10]
    Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale [J]. arXiv preprint, arXiv: 2010.11929, 2020
    [11]
    Hu Qinghao, Ye Zhisheng, Wang Zerui, et al. Characterization of large language model development in the datacenter [C] //Proc of the 21st USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2024: 709−729
    [12]
    Fang Jiarui, Yu Yang, Zhao Chengduo, et al. TurboTransformers: An efficient GPU serving system for transformer models [C] //Proc of the 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2021: 389−402
    [13]
    Crankshaw D, Wang Xin, Zhou Guilio, et al. Clipper: A low-latency online prediction serving system [C] //Proc of the 14th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2017: 613−627
    [14]
    Gao Pin, Yu Lingfan, Wu Yongwei, et al. Low latency RNN inference with cellular batching [C/OL] //Proc of the 13th EuroSys Conf. New York: ACM, 2018[2025-01-18]. https://dl.acm.org/doi/10.1145/3190508.3190541
    [15]
    Guo Demi, Rush A, Kim Y. Parameter-efficient transfer learning with diff pruning [C] //Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2021: 4884−4896
    [16]
    Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP [C] //Proc of the 36th Int Conf on Machine Learning. New York: PMLR, 2019: 2790−2799
    [17]
    Zaken E B, Ravfogel S, Goldberg Y. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models [C/OL] //Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022[2025-01-18]. https://aclanthology.org/2022.acl-short.1.pdf
    [18]
    Zhao Mengjie, Lin Tao, Mi Fei, et al. Masking as an efficient alternative to finetuning for pretrained language models [C] //Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 2226−2241
    [19]
    Hu E, Shen Yelong, Wallis P, et al. LoRA: Low-rank adaptation of large language models [J]. arXiv preprint, 2021: arXiv: 2106.09685
    [20]
    NVIDIA. Fast Transformer [EB/OL]. 2021[2024-10-04]. https://github.com/NVIDIA/FasterTransformer
    [21]
    Wang Xiaohui, Xiong Ying, Wei Yang, et al. LightSeq: A high performance inference library for transformers [J]. arXiv preprint, arXiv: 2010.13887, 2020
    [22]
    Gururajan A K, Lopez-Cuena E, Bayarri-Planas J, et al. Aloe: A family of fine-tuned open healthcare LLMs [J]. arXiv preprint, arXiv: 2405.01886, 2024
    [23]
    Gupta A, Shirgaonkar A, Balaguer A D L, et al. RAG vs fine-tuning: Pipelines, tradeoffs, and a case study on agriculture [J]. arXiv preprint, arXiv: 2401.08406, 2024
    [24]
    Yang Hongyang, Liu X, Wang C D. Fingpt: Open-source financial large language models [J]. arXiv preprint, arXiv: 2306.06031, 2023
    [25]
    Romero F, Li Qian, Yadwadkar N J, et al. INFaaS: Automated model-less inference serving [C] //Proc of the 29th USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2021: 397−411
    [26]
    Shen Haichen, Chen Lequn, Jin Yuchen, et al. Nexus: A GPU cluster engine for accelerating DNN-based video analysis[C] //Proc of the 27th ACM Symp on Operating Systems Principles. New York: ACM, 2019: 322−337
    [27]
    Sidhu S, Wing J, Japi A. Rafiqi: A GPU-based deep learning model serving system [R]. Berkeley, CA: University of California, 2020
    [28]
    NVIDIA. Triton inference server [EB/OL]. 2018[2024-10-04]. https://developer.nvidia.com/nvidia-triton-inference-server
    [29]
    Google. TensorFlow serving [EB/OL]. 2016[2024-10-04]. https://github.com/tensorflow/serving
    [30]
    Prasanna S, Rogers A, Rumshisky A. When bert plays the lottery, all tickets are winning [J]. arXiv preprint, arXiv: 2005.00561, 2020
    [31]
    Mao Yuning, Mathias L, Hou Rui, et al. UniPELT: A unified framework for parameter-efficient language model tuning [C] //Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 6253−6264
    [32]
    Gale T, Zaharia M, Young C, et al. Sparse GPU kernels for deep learning [C/OL] //Proc of the 33rd Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2020[2025-01-18]. https://dl.acm.org/doi/10.5555/3433701.3433723
  • Related Articles

    [1]Du Yuyue, Sun Ya’nan, Liu Wei. Petri Nets Based Recognition of Model Deviation Domains and Model Repair[J]. Journal of Computer Research and Development, 2016, 53(8): 1766-1780. DOI: 10.7544/issn1000-1239.2016.20160099
    [2]Zhu Jun, Guo Changguo, Wu Quanyuan. A Web Services Interaction Behavior-Environment Model Based on Generalized Stochastic Petri Nets[J]. Journal of Computer Research and Development, 2012, 49(11): 2450-2463.
    [3]Sun Cong, Tang Liyong, Chen Zhong, Ma Jianfeng. Secure Information Flow in Java by Optimized Reachability Analysis of Weighted Pushdown System[J]. Journal of Computer Research and Development, 2012, 49(5): 901-912.
    [4]Zhou Hang, Huang Zhiqiu, Zhu Yi, Xia Liang, Liu Linyuan. Real-Time Systems Contact Checking and Resolution Based on Time Petri Net[J]. Journal of Computer Research and Development, 2012, 49(2): 413-420.
    [5]Men Peng and Duan Zhenhua. Extension of Model Checking Tool of Colored Petri Nets and Its Applications in Web Service Composition[J]. Journal of Computer Research and Development, 2009, 46(8): 1294-1303.
    [6]Zhao Mingfeng, Song Wen, Yang Yixian. Confusion Detection Based on Petri-Net[J]. Journal of Computer Research and Development, 2008, 45(10): 1631-1637.
    [7]Cui Huanqing and Wu Zhehui. Structural Properties of Parallel Program's Petri Net Model[J]. Journal of Computer Research and Development, 2007, 44(12): 2130-2135.
    [8]Tang Da, Li Ye. Model Analysis of Supply Chain System Based on Color Stochastic Petri Net[J]. Journal of Computer Research and Development, 2007, 44(10): 1782-1789.
    [9]Lao Songyang, Huang Guanglian, Alan F. Smeaton, Gareth J. F. Jones, Hyowon Lee. A Query Description Model of Soccer Video Based on BSU Composite Petri-Net[J]. Journal of Computer Research and Development, 2006, 43(1): 159-168.
    [10]Li Botao and Luo Junzhou. Modeling and Analysis of Non-Repudiation Protocols by Using Petri Nets[J]. Journal of Computer Research and Development, 2005, 42(9): 1571-1577.
  • Cited by

    Periodical cited type(22)

    1. 黄蔚亮,李锦煊,余志文,蔡亚永,刘元. 确定性网络:架构、关键技术和应用. 重庆邮电大学学报(自然科学版). 2025(01): 1-16 .
    2. 姜旭艳,全巍,付文文,张小亮,孙志刚. OpenPlanner:一个开源的时间敏感网络流量规划器. 计算机研究与发展. 2025(05): 1307-1329 . 本站查看
    3. 齐玉玲,黄涛,张军贤,贾焱鑫,徐龙,熊伟,朱海龙,彭开来. 基于时间敏感网络的列车通信网络研究及应用. 城市轨道交通研究. 2024(05): 184-189 .
    4. 何倩,郭雅楠,赵宝康,潘琪,王勇. 无等待与时隙映射复用结合的时间触发流调度方法. 通信学报. 2024(08): 192-204 .
    5. 郭若彤,许方敏,张恒升,赵成林. 基于循环排队转发的时间触发流量路由与调度优化方法. 微电子学与计算机. 2024(10): 55-63 .
    6. 薛强,吴梦,杨世标,屠礼彪,李伟,廖江. 嵌入式人工智能技术在IP网络的创新应用. 邮电设计技术. 2024(10): 66-72 .
    7. 张浩,郭偶凡,周飞飞,马涛,何迎利,姚苏滨. 基于分段帧复制和消除的时间敏感网络动态冗余机制研究. 计算机科学. 2024(S2): 750-756 .
    8. 罗峰,周杰,王子通,张晓先,孙志鹏. 基于多域冗余的车载时间敏感网络时间同步增强方法. 系统工程与电子技术. 2024(12): 4259-4268 .
    9. 王雪荣,唐政治,李银川,齐美玉,朱建波,张亮. 基于优化决策树的时延敏感流智能感知调度. 电信科学. 2023(04): 120-132 .
    10. 陆以勤,谢文静,王海瀚,陈卓星,程喆,潘伟锵,覃健诚. 面向时间敏感网络的安全感知调度方法. 华南理工大学学报(自然科学版). 2023(05): 1-12 .
    11. 朱渊,胡馨予,吴思远,黄蓉. 基于OMNeT++的5G-TSN调度算法综述. 西安邮电大学学报. 2023(01): 9-18 .
    12. 王家兴,杨思锦,庄雷,宋玉,阳鑫宇. 时间敏感网络中多目标在线混合流量调度算法. 计算机科学. 2023(07): 286-292 .
    13. 李维,梁巍,周策. 基于ACO算法的SDN网络流量调度优化研究. 自动化与仪器仪表. 2023(07): 42-46 .
    14. 吴昭祥,李文凯,袁亚洲,刘志新. 时间敏感网络中基于抢占式通道模型的资源调度算法研究. 移动通信. 2023(08): 67-73+97 .
    15. 刘美鹭,刘留,王凯,韩紫杰. 真空管高速飞行列车通信业务建模. 移动通信. 2023(08): 98-106 .
    16. 彭紫梅,寿国础,郭梦杰,刘雅琼,胡怡红. 时间敏感网络中的冗余机制研究综述. 电信科学. 2023(08): 29-42 .
    17. 胡文学,孙雷,王健全,朱渊,毕紫航. 基于网络演算的时间敏感网络时延上界分析模型研究. 自动化学报. 2023(11): 2297-2310 .
    18. 王新蕾,周敏,张涛. 时间敏感网络流量调度算法研究综述. 电讯技术. 2023(11): 1830-1838 .
    19. 段晓东,刘鹏,陆璐,孙滔,李志强. 确定性网络技术综述. 电信科学. 2023(11): 1-12 .
    20. 陆以勤,熊欣,王猛,覃健诚,潘伟锵. TSN中基于链路负载均衡的AVB流量带宽分配方法. 华南理工大学学报(自然科学版). 2023(11): 1-9 .
    21. 周阳,陈鸿龙,张雷. 时间敏感网络中的动态路由与调度联合优化算法. 物联网学报. 2023(04): 52-62 .
    22. 裴金川,胡宇翔,田乐,胡涛,李子勇. 联合路由规划的时间敏感网络流量调度方法. 通信学报. 2022(12): 54-65 .

    Other cited types(36)

Catalog

    Article views (39) PDF downloads (10) Cited by(58)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return