PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers

Wei Xuechao; Zhou Zhe; Xu Yinghui; Zhang Jiejing; Xie Yuan; Sun Guangyu

doi:10.7544/issn1000-1239.202440206

Journal of Computer Research and Development > 2025 > Accepted Manuscript > DOI: 10.7544/issn1000-1239.202440206 CSTR: 32373.14.issn1000-1239.202440206

Wei Xuechao, Zhou Zhe, Xu Yinghui, Zhang Jiejing, Xie Yuan, Sun Guangyu. PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440206

Citation:

PDF (1634 KB)

PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers

1.
School of Electronics Engineering & Computer Science, Peking University, Beijing 100871
2.
School of Integrated Circuits, Peking University, Beijing 100871
3.
Artificial Intelligence Innovation and Incubation Institute, Fudan University, Shanghai 201203
4.
Alibaba DAMO Academy, Hangzhou 310023
5.
Department of Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077

Funds: This work was supported by the National Natural Science Foundation of China (62032001) and the Higher Education Discipline Innovation Project (B18001).

More Information

Author Bio:
Wei Xuechao: born in 1988. PhD. Joint postdoc researcher of Peking University and Alibaba DAMO Academy. His main research interests include computer architecture and electronic design automation

Zhou Zhe: born in 1997. PhD. His main research interests include computer architecture, near data processing, domain specific accelerator, machine learning system and interconnection techniques

Xu Yinghui: born in 1973. PhD. PhD supervisor. His main research interests include generative AI, logic reasoning, large language model for life science

Zhang Jiejing: born in 1986. Director of Alibaba Tongyi lab heterogeneous computing team. His main research interests include deep learning inference

Xie Yuan: born in 1974. PhD. IEEE fellow, ACM fellow, AAAS fellow, Chair professor. His main research interests include computer architecture, VLSI, electronic design automation, machine learning

Sun Guangyu: born in 1981. PhD, professor, PhD supervisor. His main research interests include electronic design automation for domain specific architecture and new devices, architecture/circuit/device codesign and automation
Received Date: March 20, 2024
Revised Date: February 17, 2025
Accepted Date: March 02, 2025
Available Online: March 02, 2025

Graphical Abstract

Abstract

Abstract

Deploying Transformer models under the conventional pre-train-then-fine-tune paradigm is challenging for multi-task serving, because a full model copy for each downstream task must be maintained, quickly exhausting the storage budget. Recent algorithmic advances in Parameter-Efficient Transformers (PETs) have shown enormous potential to mitigate the storage overhead. They share the pre-trained model among tasks and only fine-tune a small portion of task-specific parameters. Unfortunately, existing serving systems neither have flexible PET task management mechanisms nor can efficiently serve queries to different tasks in batches. Therefore, we propose PetS, a unified framework for multi-task PETs serving. Specifically, different PET tasks are expressed by a unified representation in the same framework, which enables flexible PET task management. Based on the unified representation, we design a specialized PET inference engine to batch different tasks' queries together and execute them with task-agnostic shared operators and task-specific PET operators. Equipped with the PET inference engine, PetS is more scalable with respect to the number of tasks on a single GPU device. To further improve system throughput, we propose a coordinated batching strategy taking query length, PET task type as well as system load balancing together into consideration. To improve the throughput on multiple GPU instances, we also propose a PET-migration based load balancing strategy. We evaluate PetS on platforms with single GPU, including Edge/Desktop/Server GPUs. Comprehensive experiments demonstrate that PetS supports up to 26x more concurrent tasks and improves the serving throughput by 1.53x and 1.63x on Desktop and Server GPU nodes, respectively. On multiple GPUs, our load-balancing strategy also provides up to 29% speedup.
- inference serving,
- parameter-efficient transformers,
- GPU,
- distributed system,
- machine-learning system

FullText(HTML)

References (32)

References

[1]	Devlin J, Chang M, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [C] //Conf of the 34th North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 4171−4186
[2]	Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners [EB/OL]. San Francisco, CA: OpenAI, 2019[2024-10-04]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[3]	Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners [C] //Proc of the 34th Int Conf on Neural Information Processing Systems. New York: ACM, 2020: 1877−1901
[4]	Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training [EB/OL]. San Francisco, CA: OpenAI, 2018[2024-10-04]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[5]	Liu Yinhan, Ott M, Goyal N, et al. Roberta: A robustly optimized BERT pretraining approach [J]. arXiv preprint, arXiv: 1907.11692, 2019
[6]	Yang Zhilin, Dai Zihang, Yang Yiming, et al. XLNet: Generalized autoregressive pretraining for language understanding [C] //Proc of the 33rd Int Conf on Neural Information Processing Systems. New York: ACM, 2019: 5753−5763
[7]	Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485−5551
[8]	Zhang Sunan, Roller S, Goyal N, et al. Opt: Open pre-trained transformer language models [J]. arXiv preprint, arXiv: 2205.01068, 2022
[9]	Liu Ze, Lin Yutong, Cao Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows [C] //Proc of the 35th IEEE/CVF Int Conf on Computer Vision (ICCV). Piscataway, NJ: IEEE, 2021: 10012−10022
[10]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale [J]. arXiv preprint, arXiv: 2010.11929, 2020
[11]	Hu Qinghao, Ye Zhisheng, Wang Zerui, et al. Characterization of large language model development in the datacenter [C] //Proc of the 21st USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2024: 709−729
[12]	Fang Jiarui, Yu Yang, Zhao Chengduo, et al. TurboTransformers: An efficient GPU serving system for transformer models [C] //Proc of the 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2021: 389−402
[13]	Crankshaw D, Wang Xin, Zhou Guilio, et al. Clipper: A low-latency online prediction serving system [C] //Proc of the 14th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2017: 613−627
[14]	Gao Pin, Yu Lingfan, Wu Yongwei, et al. Low latency RNN inference with cellular batching [C/OL] //Proc of the 13th EuroSys Conf. New York: ACM, 2018[2025-01-18]. https://dl.acm.org/doi/10.1145/3190508.3190541
[15]	Guo Demi, Rush A, Kim Y. Parameter-efficient transfer learning with diff pruning [C] //Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2021: 4884−4896
[16]	Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP [C] //Proc of the 36th Int Conf on Machine Learning. New York: PMLR, 2019: 2790−2799
[17]	Zaken E B, Ravfogel S, Goldberg Y. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models [C/OL] //Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022[2025-01-18]. https://aclanthology.org/2022.acl-short.1.pdf
[18]	Zhao Mengjie, Lin Tao, Mi Fei, et al. Masking as an efficient alternative to finetuning for pretrained language models [C] //Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 2226−2241
[19]	Hu E, Shen Yelong, Wallis P, et al. LoRA: Low-rank adaptation of large language models [J]. arXiv preprint, 2021: arXiv: 2106.09685
[20]	NVIDIA. Fast Transformer [EB/OL]. 2021[2024-10-04]. https://github.com/NVIDIA/FasterTransformer
[21]	Wang Xiaohui, Xiong Ying, Wei Yang, et al. LightSeq: A high performance inference library for transformers [J]. arXiv preprint, arXiv: 2010.13887, 2020
[22]	Gururajan A K, Lopez-Cuena E, Bayarri-Planas J, et al. Aloe: A family of fine-tuned open healthcare LLMs [J]. arXiv preprint, arXiv: 2405.01886, 2024
[23]	Gupta A, Shirgaonkar A, Balaguer A D L, et al. RAG vs fine-tuning: Pipelines, tradeoffs, and a case study on agriculture [J]. arXiv preprint, arXiv: 2401.08406, 2024
[24]	Yang Hongyang, Liu X, Wang C D. Fingpt: Open-source financial large language models [J]. arXiv preprint, arXiv: 2306.06031, 2023
[25]	Romero F, Li Qian, Yadwadkar N J, et al. INFaaS: Automated model-less inference serving [C] //Proc of the 29th USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2021: 397−411
[26]	Shen Haichen, Chen Lequn, Jin Yuchen, et al. Nexus: A GPU cluster engine for accelerating DNN-based video analysis[C] //Proc of the 27th ACM Symp on Operating Systems Principles. New York: ACM, 2019: 322−337
[27]	Sidhu S, Wing J, Japi A. Rafiqi: A GPU-based deep learning model serving system [R]. Berkeley, CA: University of California, 2020
[28]	NVIDIA. Triton inference server [EB/OL]. 2018[2024-10-04]. https://developer.nvidia.com/nvidia-triton-inference-server
[29]	Google. TensorFlow serving [EB/OL]. 2016[2024-10-04]. https://github.com/tensorflow/serving
[30]	Prasanna S, Rogers A, Rumshisky A. When bert plays the lottery, all tickets are winning [J]. arXiv preprint, arXiv: 2005.00561, 2020
[31]	Mao Yuning, Mathias L, Hou Rui, et al. UniPELT: A unified framework for parameter-efficient language model tuning [C] //Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 6253−6264
[32]	Gale T, Zaharia M, Young C, et al. Sparse GPU kernels for deep learning [C/OL] //Proc of the 33rd Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2020[2025-01-18]. https://dl.acm.org/doi/10.5555/3433701.3433723

[1]	Yang Lihua, Dong Yong, Wu Huijun, Tan Zhipeng, Wang Fang, Lu Kai. Survey of Log-Structured File Systems in Mobile Devices[J]. Journal of Computer Research and Development, 2025, 62(1): 58-74. DOI: 10.7544/issn1000-1239.202330789
[2]	Chen Huimin, Jin Sichen, Lin Wei, Zhu Zeyu, Tong Lingbo, Liu Yipeng, Ye Yining, Jiang Weihan, Liu Zhiyuan, Sun Maosong, Jin Jianbin. Quantitative Analysis on the Communication of COVID-19 Related Social Media Rumors[J]. Journal of Computer Research and Development, 2021, 58(7): 1366-1384. DOI: 10.7544/issn1000-1239.2021.20200818
[3]	Guo Hongyi, Liu Gongshen, Su Bo, Meng Kui. Collaborative Filtering Recommendation Algorithm Combining Community Structure and Interest Clusters[J]. Journal of Computer Research and Development, 2016, 53(8): 1664-1672. DOI: 10.7544/issn1000-1239.2016.20160175
[4]	Wang Di, Zhao Tianlei, Tang Yuxing, Dou Qiang. A Communication Feature-Oriented 3D NoC Architecture Design[J]. Journal of Computer Research and Development, 2014, 51(9): 1971-1979. DOI: 10.7544/issn1000-1239.2014.20130131
[5]	Chen Ping, Xing Xiao, Xin Zhi, Wang Yi, Mao Bing, and Xie Li. Protecting Programs Based on Randomizing the Encapsulated Structure[J]. Journal of Computer Research and Development, 2011, 48(12): 2227-2234.
[6]	Li Shaofang, Hu Shanli, Shi Chunyi. An Anytime Coalition Structure Generation Based on the Grouping Idea of Cardinality Structure[J]. Journal of Computer Research and Development, 2011, 48(11): 2047-2054.
[7]	Liu Jinglei, Zhang Wei, Liu Zhaowei, and Sun Xuejiao. Properties and Application of Coalition Structure Graph[J]. Journal of Computer Research and Development, 2011, 48(4): 602-609.
[8]	Su Shexiong, Hu Shanli, Zheng Shengfu, Lin Chaofeng, and Luo Jianbin. An Anytime Coalition Structure Generation Algorithm Based on Cardinality Structure[J]. Journal of Computer Research and Development, 2008, 45(10): 1756.
[9]	Cao Yafei, Wang Dawei, and Li Sikun. A Novel System-Level Communication Synthesis Methodology Containing Crossbar Bus and Shared Bus[J]. Journal of Computer Research and Development, 2008, 45(8): 1439-1445.
[10]	Zheng Zhirong, Cai Yi, and Shen Changxiang. Research on an Application Class Communication Security Model on Operating System Security Framework[J]. Journal of Computer Research and Development, 2005, 42(2): 322-328.

Cited By

Cited by

Periodical cited type(5)

1.	何业锋，刘闪闪，刘妍，权家辉，田哲铭，杨梦玫，李智. 支持虚拟车辆辅助假名更新的混合区位置隐私保护方案. 计算机应用研究. 2024(01): 272-276 .
2.	况博裕，李雨泽，顾芳铭，苏铓，付安民. 车联网安全研究综述：威胁、对策与未来展望. 计算机研究与发展. 2023(10): 2304-2321 . 本站查看
3.	王佳星，周武源，李甜甜. 人工智能发展态势的文献计量分析与研究. 小型微型计算机系统. 2023(11): 2424-2433 .
4.	张迪，曹利，李原帅. 车联网环境下基于多策略访问树的安全访问控制算法. 计算机应用研究. 2023(11): 3394-3401 .
5.	邓雨康，张磊，李晶. 车联网隐私保护研究综述. 计算机应用研究. 2022(10): 2891-2906 .