Research Progress on Large Models for Edge Intelligence

Wang Rui; Zhang Liuyang; Gao Zhiyong; Jiang Tongyun

doi:10.7544/issn1000-1239.202440385

Journal of Computer Research and Development > 2025 > Uncorrected proof > DOI: 10.7544/issn1000-1239.202440385 CSTR: 32373.14.issn1000-1239.202440385

Wang Rui, Zhang Liuyang, Gao Zhiyong, Jiang Tongyun. Research Progress on Large Models for Edge Intelligence[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440385

Citation:

Wang Rui, Zhang Liuyang, Gao Zhiyong, Jiang Tongyun. Research Progress on Large Models for Edge Intelligence[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440385

Citation:

Wang Rui, Zhang Liuyang, Gao Zhiyong, Jiang Tongyun. Research Progress on Large Models for Edge Intelligence[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440385

PDF (1314 KB)

Research Progress on Large Models for Edge Intelligence

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083

More Information

Author Bio:
Wang Rui: born in 1975. PhD, professor. Senior member of CCF. His main research interests include IoT, edge intelligence, and smart healthcare

Zhang Liuyang: born in 2000. Master candidate. His main research interests include federated learning and large model training

Gao Zhiyong: born in 2000. Master candidate. His main research interests include large model inference and machine learning

Jiang Tongyun: born in 2000. Master candidate. Her main research interests include edge intelligence and large model
Received Date: May 30, 2024
Accepted Date: January 25, 2025
Available Online: January 25, 2025

Graphical Abstract

Abstract

Abstract

With the rapid development of large-scale model technology, these models have exhibited remarkable performance in fields such as natural language processing and computer vision, becoming essential tools for addressing complex issues and drawing significant interest from both the scientific community and the industry. Nonetheless, current cloud-platform-based schemes for training and inference of large models face multiple challenges, including high expenses, restricted scalability, and information security risks. As the scale of model parameters expands continually, the need for low-cost, efficient training and inference methods grows ever more pressing. Carrying out collaborative training and inference of large models on edge devices can dramatically decrease latency and bandwidth demands, concurrently reinforcing data privacy and operational efficiency. This strategy furnishes vital technological support for the economical deployment of large models across a variety of contexts, thereby evolving into one of the prominent research hotspots. This article conducts a thorough investigation of research pertinent to large models in the context of edge intelligence, with an in-depth analysis and discourse primarily focused on two aspects: edge-based training and inference of large models. Ultimately, it outlines the challenges confronted in the progression of large model technologies tailored for edge intelligence and delineates future prospects. The ambition is to stimulate a heightened comprehension and intensified attention from both academic and industrial sectors towards technologies involving large models for edge intelligence, thereby encouraging further scholarly exploration in this thriving domain.
- edge intelligence,
- large models,
- federated fine-tuning of large models,
- edge efficient inference

FullText(HTML)

References (172)

References

[1]	OpenAI. ChatGPT: Optimizing language models for dialogue[EB/OL]. (2022-12-30)[2024-02-10]. https://‌openai.‌com/‌blog/‌chatgpt/‌#rf2
[2]	Achiam J, Adler S, Agarwal S, et al. GPT−4 technical report[J]. arXiv preprint, arXiv: 2303.08774, 2023
[3]	Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023
[4]	Liu Haotian, Li Chunyuan, Wu Qingyang, et al. Visual instruction tuning[J]. arXiv preprint, arXiv: 2304.08485, 2023
[5]	Kirillov A, Mintun E, Ravi N, etc. Segment anything[J]. arXiv preprint, arXiv: 2304.02643, 2023
[6]	Touvron H, Martin L, Stone K, el al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
[7]	王睿,齐建鹏,陈亮,等. 面向边缘智能的协同推理综述[J]. 计算机研究与发展,2021,60(2):398−414 Wang Rui, Qi Jianpeng, Chen Liang, et al. Survey of collaborative inference for edge intelligence[J]. Journal of Computer Research and Development, 2021, 60(2): 398−414 (in Chinese)
[8]	Alizadeh K, Mirzadeh I, Belenko D, et al. LLM in a flash: Efficient large language model inference with limited memory[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 12562–12584
[9]	Mcmahan H B, Moore E, Ramage D, et al. Communication-efficient learning of deep networks from decentralized data[C]//Proc of the 20th Int Conf on Artificial Intelligence and Statistics PMLR. New York: ACM, 2017: 1273−1282
[10]	Custers B, Sears A M, Dechesne F, et al. EU Personal Data Protection in Policy and Practice[M]. The Hague, The Netherlands: TMC Asser Press, 2019
[11]	Lambda. OpenAI’s GPT−3 language model: A technical overview[EB/OL]. (2020-06-03)[2024-01-08]. https://lambdalabs.com/blog/demystifying-gpt-3#1
[12]	Ananthaswamy A. In AI, is bigger always better?[J]. Nature, 2023, 615(7951): 202−205 doi: 10.1038/d41586-023-00641-w
[13]	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Proc of the 33rd Int Conf on Neural Information Processing Systems. New York: ACM, 2020: 1877−1901
[14]	Lv Kai, Yang Yuqing, Liu Tengxiao, et al. Full parameter fine-tuning for large language models with limited resources[J]. arXiv preprint, arXiv: 2306.09782, 2023
[15]	Lv Kai, Yan Hang, Guo Qipeng, et al. AdaLomo: Low-memory optimization with adaptive learning rate[J]. arXiv preprint, arXiv: 2310.10195, 2023
[16]	Malladi S, Gao Tianyu, Nichani E, et al. Fine-tuning language models with just forward passes[J]. arXiv preprint, arXiv: 2305.17333, 2023
[17]	Ding Ning, Qin Yujia, Yang Guang, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models[J]. Nature Machine Intelligence, 2023, 5(3): 220−235 doi: 10.1038/s42256-023-00626-4
[18]	Chen Chaochao, Feng Xiaohua, Zhou Jun, et al. Federated large language model: A position paper[J]. arXiv preprint, arXiv: 2307.08925, 2023
[19]	Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]//Proc of the 36th Int Conf on Machine Learning PMLR. New York: ACM, 2019: 2790−2799
[20]	Hu Zhiqiang, Lan Yihuai, Wang Lei, et al. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models[J]. arXiv preprint, arXiv: 2304.01933, 2023
[21]	Karimi M, Henderson J, Ruder S. Compacter: Efficient low-rank hypercomplex adapter layers[C]//Proc of the 34th Int Conf on Neural Information Processing Systems. New York: ACM, 2021: 1022−1035
[22]	Li X, Liang P. Prefix-tuning: Optimizing continuous prompts for generation[J]. arXiv preprint, arXiv: 2101.00190, 2021
[23]	Zhang Renrui, Han Jiaming, Zhou Aojun, et al. Llama-adapter: Efficient fine-tuning of language models with zero-init attention[J]. arXiv preprint, arXiv: 2303.16199, 2023
[24]	Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning[J]. arXiv preprint, arXiv: 2104.08691, 2021
[25]	Sun Tianxiang, He Zhengfu, Zhu Qin, et al. Multitask pre-training of modular prompt for chinese few-shot learning[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 11156−11172
[26]	Gu Yuxian, Han Xu, Liu Zhiyuan, et al. PPT: Pre-trained prompt tuning for few-shot learning[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 8410−8423
[27]	Zhang Qingru, Chen Minshuo, Bukharin A, et al. Adaptive budget allocation for parameter-efficient fine-tuning[J]. arXiv preprint, arXiv: 2303.10512, 2023
[28]	Chen Yukang, Qian Shengju, Tang Haotian, et al. Longlora: Efficient fine-tuning of long-context large language models[J]. arXiv preprint, arXiv: 2309.12307, 2023
[29]	Chua T J, Yu Wenhan, Zhao Jun, et al. FedPEAT: Convergence of federated learning, parameter-efficient fine tuning, and emulator assisted tuning for artificial intelligence foundation models with mobile edge computing[J]. arXiv preprint, arXiv: 2310.17491, 2023
[30]	Che Tianshi, Liu Ji, Zhou Yang, et al. Federated learning of large language models with parameter-efficient prompt tuning and adaptive optimization[J]. arXiv preprint, arXiv: 2310.15080, 2023
[31]	Babakniya S, Elkordy A R, Ezzeldin Y H, et al. SLoRA: Federated parameter efficient fine-tuning of language models[J]. arXiv preprint, arXiv: 2308.06522, 2023
[32]	Zhang Zhuo, Yang Yuanhang, Dai Yong, et al. FedPETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 9963−9977
[33]	Kuang Weirui, Qian Bingchen, Li Zitao, et al. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning[J]. arXiv preprint, arxiv: 2309.00363, 2023
[34]	Fan Tao, Kang Yan, Ma Guoqiang, et al. Fate-llm: A industrial grade federated learning framework for large language models[J]. arXiv preprint, arxiv: 2310.10049, 2023
[35]	Chen Haokun, Zhang Yao, Krompass D, et al. FedDAT: An approach for foundation model finetuning in multi-modal heterogeneous federated Learning[J]. arXiv preprint, arXiv: 2308.12305, 2023
[36]	Guo Tao, Guo Song, Wang Junxiao, et al. Promptfl: Let federated participants cooperatively learn prompts instead of models-federated learning in age of foundation model[J]. IEEE Transactions on Mobile Computing, 2023, 23(5): 5179−5194
[37]	Xu Mengwei, Yin Wangsong, Cai Dongqi, et al. A survey of resource-efficient LLM and multimodal foundation models[J]. arXiv preprint, arXiv: 2401.08092, 2024
[38]	Wan Zhongwei, Wang Xin, Liu Che, et al. Efficient large language models: A survey[J]. arXiv preprint, arXiv: 2312.03863, 2023
[39]	Miao Xupeng, Oliaro G, Zhang Zhihao, et al. Towards efficient generative large language model serving: A survey from algorithms to systems[J]. arXiv preprint, arXiv: 2312.15234, 2023
[40]	Kachris C. A survey on hardware accelerators for large language models[J]. arXiv preprint, arXiv: 2401.09890, 2024
[41]	Zhong Juan, Liu Zheng, Chen Xi. Transformer-based models and hardware acceleration analysis in autonomous driving: A survey[J]. arXiv preprint, arXiv: 2304.10891, 2023
[42]	Emani M, Foreman S, Sastry V, et al. A comprehensive performance study of large language models on novel AI accelerators[J]. arXiv preprint, arXiv: 2310.04607, 2023
[43]	张晓东,张朝昆,赵继军. 边缘智能研究进展[J]. 计算机研究与发展,2023,60(12):2749−2769 doi: 10.7544/issn1000-1239.202220192 Zhang Xiaodong, Zhang Chaokun, Zhao Jijun. State-of-the-Art survey on edge intelligence[J]. Journal of Computer Research and Development, 2023, 60(12): 2749−2769 (in Chinese) doi: 10.7544/issn1000-1239.202220192
[44]	Zhu Xunyu, Li Jian, Liu Yong, et al. A survey on model compression for large language models[J]. arXiv preprint, arXiv: 2308.07633, 2023
[45]	Ma Xinyin, Fang Gongfan, Wang Xinchao. LLM-Pruner: On the structural pruning of large language models[J]. arXiv preprint, arXiv: 2305.11627, 2023
[46]	Xia Mengzhou, Gao Tianyu, Zeng Zhiyuan, et al. Sheared LLaMA: Accelerating language model pre-training via structured pruning[J]. arXiv preprint, arXiv: 2310.06694, 2023
[47]	Wang Hanrui, Zhang Zhekai, Han Song. SpAtten: Efficient sparse attention architecture with cascade token and head pruning[C]//Proc of the 27th IEEE Int Symp on High-Performance Computer Architecture. Piscataway, NJ: IEEE, 2021: 97−110
[48]	Zhang Mingyang, Chen Hao, Shen Chunhua, et al. LoRAPrune: Pruning meets low-rank parameter-efficient fine-tuning[J]. arXiv preprint, arXiv: 2305.18403, 2023
[49]	Xia Haojun, Zheng Zhen, Li Yuchao, et al. Flash-LLM: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity[J]. arXiv preprint, arXiv: 2309.10285, 2023
[50]	Frantar E, Alistarh D. SparseGPT: Massive language models can be accurately pruned in one-shot[C]//Proc of the 40th Int Conf on Machine Learning PMLR. New York: ACM, 2023: 10323−10337
[51]	Sun Mingjie, Liu Zhuang, Bair A, et al. A simple and effective pruning approach for large language models[J]. arXiv preprint, arXiv: 2306.11695, 2023
[52]	Liang Chen, Zuo Simiao, Zhang Qingru, et al. Less is more: Task-aware layer-wise distillation for language model compression[C]//Proc of the 40th Int Conf on Machine Learning PMLR. New York: ACM, 2023: 20852−20867
[53]	Zhang Chen, Song Dawei, Ye Zheyu, et al. Towards the law of capacity gap in distilling language models[J]. arXiv preprint, arXiv: 2311.07052, 2023
[54]	Padmanabhan S, Onoe Y, Zhang M, et al. Propagating knowledge updates to LMs through distillation[J]. arXiv preprint, arXiv: 2306.09306, 2023
[55]	Agarwal R, Vieillard N, Zhou Yongchao, et al. On-policy distillation of language models: Learning from self-generated mistakes[J]. arXiv preprint, arXiv: 2306.13649, 2024
[56]	Gu Yuxian, Dong Li, Wei Furu, et al. Knowledge distillation of large language models[J]. arXiv preprint, arXiv: 2306.08543, 2023
[57]	Timiryasov I, Tastet J L. Baby llama: Knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty[J]. arXiv preprint, arXiv: 2308.02019, 2023
[58]	Xiong Yunyang, Varadarajan B, Wu Lemeng, et al. EfficientSAM: Leveraged masked image pretraining for efficient segment anything[J]. arXiv preprint, arXiv: 2312.00863, 2023
[59]	Yuan Jianlong, Phan M H, Liu Liyang, et al. FAKD: Feature augmented knowledge distillation for semantic segmentation[C]//Proc of the 2024 IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2024: 595−605
[60]	Nasser S A, Gupte N, Sethi A. Reverse knowledge distillation: Training a large model using a small one for retinal image matching on limited data[C]//Proc of the 2024 IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2024: 7778−7787
[61]	Zhu Xuekai, Qi Biqing, Zhang Kaiyan, et al. PaD: Program-aided distillation specializes large models in reasoning[J]. arXiv preprint, arXiv: 2305.13888, 2023
[62]	Li L H, Hessel J, Yu Youngjae, et al. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 2665−2679
[63]	Shridhar K, Stolfo A, Sachan M. Distilling reasoning capabilities into smaller language models[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 7059−7073
[64]	Ho N, Schmid L, Yun S Y. Large language models are reasoning teachers[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 14852−14882
[65]	Wang Peifeng, Wang Zhengyang, Li Zheng, et al. SCOTT: Self-consistent chain-of-thought distillation[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 5546−5558
[66]	Hsieh C Y, Li C L, Yeh C K, et al. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 8003−8017
[67]	Chen Zeming, Gao Qiyue, Bosselut A, et al. DISCO: Distilling counterfactuals with large language models[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 5514−5528
[68]	Jiang Yuxin, Chan C, Chen Mingyang, et al. Lion: Adversarial distillation of proprietary large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL 2023: 3134−3154
[69]	Fu Yao, Peng Hao, Ou Litu, et al. Specializing smaller language models towards multi-step reasoning[C]//Proc of the 40th Int Conf on Machine Learning PMLR. New York: ACM, 2023: 10421−10430
[70]	Wu Minghao, Waheed A, Zhang Chiyu, et al. LaMini-LM: A diverse herd of distilled models from large-scale instructions[J]. arXiv preprint, arXiv: 2304.14402, 2024
[71]	Lin Ji, Tang Jiaming, Tang Haotian, et al. AWQ: Activation-aware weight quantization for LLM compression and acceleration[J]. arXiv preprint, arXiv: 2306.00978, 2023
[72]	Li Qingyuan, Zhang Yifan, Li Liang, et al. FPTQ: Fine-grained post-training quantization for large language models[J]. arXiv preprint, arXiv: 2308.15987, 2023
[73]	Wei Xiuying, Zhang Yunchen, Zhang Xiangguo, et al. Outlier suppression: Pushing the limit of low-bit transformer language models[C]//Proc of the 36th Int Conf on Neural Information Processing Systems. New York: ACM, 2022: 17402−17414
[74]	Wei Xiuying, Zhang Yunchen, Li Yuhang, et al. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 1648−1665
[75]	Guo Cong, Tang Jiaming, Hu Weiming, et al. OliVe: Accelerating large language models via hardware-friendly outlier-victim pair quantization[C/OL]//Proc of the 50th Annual Int Symp on Computer Architecture. New York: ACM, 2023[2024-9-10]. https: /doi.org/10.1145/3579371.3589038
[76]	Yao Zhewei, Yazdani A R, Zhang Minjia, et al. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers[C]//Proc of the 36th Int Conf on Neural Information Processing Systems. New York: ACM, 2022: 27168−27183
[77]	Dettmers T, Lewis M, Belkada Y, et al. LLM. int8(): 8-bit matrix multiplication for transformers at scale[C]//Proc of the 36th Int Conf on Neural Information Processing Systems. New York: ACM, 2022: 30318−30332
[78]	Frantar E, Ashkboos S, Hoefler T, et al. GPTQ: Accurate quantization for generative pre-trained transformers[C/OL]//Proc of the 11th Int Conf on Learning Representations. OpenReview. net, 2023[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌tcbBPnfwxS
[79]	Xiao Guangxuan, Lin Ji, Seznec M, et al. SmoothQuant: Accurate and efficient post-training quantization for large language models[C]//Proc of the 40th Int Conf on Machine Learning PMLR. New York: ACM, 2023: 38087−38099
[80]	Dettmers T, Svirschevski R, Egiazarian V, et al. SpQR: A sparse-quantized representation for near-lossless LLM weight compression[J]. arXiv preprint, arXiv: 2306.03078, 2023
[81]	Lee Changhun, Jin Jungyu, Kim T, et al. OWQ: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models[C]//Proc of the 38th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024: 13355−13364
[82]	Wang Hongyu, Ma Shuming, Dong Li, et al. BitNet: Scaling 1-bit transformers for large language models[J]. arXiv preprint, arXiv: 2310.11453, 2023
[83]	Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient finetuning of quantized LLMs[C]//Proc of the 37th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2023: 10088−10115
[84]	Kim J, Lee J H, Kim S, et al. Memory-efficient fine-tuning of compressed large language models via sub−4-bit integer quantization[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2023: 36187−36207
[85]	Liu Zechun, Oguz B, Zhao Changsheng, et al. LLM-QAT: Data-free quantization aware training for large language models[J]. arXiv preprint, arXiv: 2305.17888, 2023
[86]	Liu Xinyu, Wang Tao, Yang Jiaming, et al. MPQ-YOLO: Ultra low mixed-precision quantization of YOLO for edge devices deployment[J]. Neurocomputing, 2024, 574: 127210 doi: 10.1016/j.neucom.2023.127210
[87]	Kaushal A, Vaidhya T, Rish I. LORD: Low rank decomposition of monolingual code LLMs for one-shot compression[C/OL]//Proc of the 41st ICML 2024 Workshop on Foundation Models in the Wild. OpenReview. net, 2024[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌br49PQvuMp
[88]	Li Yixiao, Yu Yifan, Zhang Qingru, et al. LoSparse: Structured compression of large language models based on low-rank and sparse approximation[C]//Proc of the 40th Int Conf on Machine Learning. New York: PMLR, 2023: 20336−20350
[89]	Xu Mingxue, Xu Yaolei, Mandic D P. TensorGPT: Efficient compression of the embedding layer in LLMs based on the tensor-train decomposition[J]. arXiv preprint, arXiv: 2307.00526, 2023
[90]	Chang C C, Sung Y Y, Yu Shixing, et al. FLORA: Fine-grained low-rank architecture search for vision transformer[C]//Proc of the 2024 IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2024: 2482−2491
[91]	Benedek N, Wolf L. PRILoRA: Pruned and rank-increasing low-rank adaptation[J]. arXiv preprint, arXiv: 2401.11316, 2024
[92]	Cheng Hongrong, Zhang Miao, Shi J Q. A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations[J]. arXiv preprint, arXiv: 2308.06767, 2023
[93]	Xu Xiaohan, Li Ming, Tao Chongyang, et al. A survey on knowledge distillation of large language models[J]. arXiv preprint, arXiv: 2402.13116, 2024
[94]	Zhu Xunyu, Li Jian, Liu Yong, et al. A survey on model compression for large language models[J]. arXiv preprint, arXiv: 2308.07633, 2023
[95]	Hu E, Shen Yelong, Wallis P, et al. LoRA: Low-rank adaptation of large language models[C/OL]//Proc of the 10th Int Conf on Learning Representations. OpenReview. net, 2022[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌nZeVKeeFYf9
[96]	Liu Jing, Gong Ruihao, Wei Xiuying, et al. QLLM: Accurate and efficient low-bitwidth quantization for large language models[C/OL]//Proc of the 12th Int Conf on Learning Representations. OpenReview. net, 2024[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌FIplmUWdm3
[97]	Xiao Guangxuan, Tian Yuandong, Chen Beidi, et al. Efficient streaming language models with attention sinks[C/OL]//Proc of the 12th Int Conf on Learning Representations. OpenReview. net, 2024[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌NG7sS51zVF
[98]	Liu Zichang, Desai A, Liao Fangshuo, et al. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time[C]//Proc of the 37th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2023: 52342−52364
[99]	Zhang Zhenyu, Sheng Ying, Zhou Tianyi, et al. H2O: Heavy-hitter oracle for efficient generative inference of large language models[C]//Proc of the 37th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2023: 34661−34710
[100]	Ge Suyu, Zhang Yunan, Liu Liyuan, et al. Model tells you what to discard: Adaptive KV cache compression for LLMs[C/OL]//Proc of the 12th Int Conf on Learning Representations. OpenReview. net, 2024[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌uNrFpDPMyo
[101]	Hooper C, Kim S, Mohammadzadeh H, et al. KVQuant: Towards 10 million context length LLM Inference with KV cache quantization[J]. arXiv preprint, arXiv: 2401.18079, 2024
[102]	Kwon W, Li Zhuohan, Zhuang Siyuan, et al. Efficient memory management for large language model serving with pagedattention[C]//Proc of the 29th Symp on Operating Systems Principles. New York: ACM, 2023: 611−626
[103]	Del C L, Del G A, Agarwal S, et al. SkipDecode: Autoregressive skip decoding with batching and caching for efficient LLM inference[J]. arXiv preprint, arXiv: 2307.02628, 2023
[104]	Zeng Dewen, Du Nan, Wang Tao, et al. Learning to skip for language modeling[J]. arXiv preprint, arXiv: 2311.15436, 2023
[105]	Schuster T, Fisch A, Gupta J, et al. Confident adaptive language modeling[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2022: 17456−17472
[106]	Sun Tianxiang, Liu Xiangyang, Zhu Wei, et al. A simple hash-based early exiting approach for language understanding and generation[J]. arXiv preprint, arXiv: 2203.01670, 2022
[107]	Liao Kaiyuan, Zhang Yi, Ren Xuancheng, et al. A global past-future early exit method for accelerating inference of pre-trained language models[C]//Proc of the 2021 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 2013−2023
[108]	Kong Jun, Wang Jin, Yu L C, et al. Accelerating inference for pretrained language models by unified multi-perspective early exiting[C]//Proc of the 29th Int Conf on Computational Linguistics. Stroudsburg, PA: ACL, 2022: 4677−4686
[109]	Zeng Ziqian, Hong Yihuai, Dai Hongliang, et al. ConsistentEE: A consistent and hardness-guided early exiting method for accelerating language models inference[C]//Proc of the 38th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024: 19506−19514
[110]	Bae S, Ko J, Song H, et al. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 5910−5924
[111]	Valmeekam C S K, Narayanan K K D, Kalathil D, et al. LLMZip: Lossless text compression using large language models[J]. arXiv preprint, arXiv: 2306.04050, 2023
[112]	Chevalier A, Wettig A, Ajith A, et al. Adapting language models to compress contexts[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 3829−3846
[113]	Li Yucheng, Dong Bo, Guerin F, et al. Compressing context to enhance inference efficiency of large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 6342−6353
[114]	Jiang Huiqiang, Wu Qianhui, Lin C Y, et al. LLMLingua: Compressing prompts for accelerated inference of large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 13358−13376
[115]	Jiang Huiqiang, Wu Qianhui, Luo Xufang, et al. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 1658−1677
[116]	Fu Yichao, Bailis P, Stoica I, et al. Break the sequential dependency of LLM inference using lookahead decoding[C]//Proc of the 41st Int Conf on Machine Learning. New York: PMLR, 2024: 14060−14079
[117]	Leviathan Y, Kalman M, Matias Y. Fast inference from transformers via speculative decoding[C]//Proc of the 40th int Conf on Machine Learning. New York: PMLR, 2023: 19274−19286
[118]	Miao Xupeng, Oliaro G, Zhang Zhihao, et al. SpecInfer: Accelerating generative large language model serving with tree-based speculative inference and verification[C]//Proc of the 29th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems, Volume 3. New York: ACM, 2024: 932–949
[119]	Cai T, Li Yuhong, Geng Zhengyang, et al. Medusa: Simple LLM inference acceleration framework with multiple decoding heads[C]//Proc of the 41st int Conf on Machine Learning. New York: PMLR, 2024: 5209−5235
[120]	Li Yuhui, Wei Fangyun, Zhang Chao, et al. EAGLE: Speculative sampling requires rethinking feature uncertainty[C]//Proc of the 41st int Conf on Machine Learning. New York: PMLR, 2024: 28935−28948
[121]	Xu Daliang, Yin Wangsong, Jin Xin, et al. LLMCad: Fast and scalable on-device large language model inference[J]. arXiv preprint, arXiv: 2309.04255, 2023
[122]	Shen Haihao, Chang Hanwen, Dong Bo, et al. Efficient llm inference on cpus[J]. arXiv preprint, arXiv: 2311.00502, 2023
[123]	Dao T, Fu Dan, Ermon S, et al. FlashAttention: Fast and memory-efficient exact attention with IO-awareness[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2022: 16344−16359
[124]	Dao T. FlashAttention−2: Faster attention with better parallelism and work partitioning[C/OL]//Proc of the 12th Int Conf on Learning Representations. OpenReview. net, 2024[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌mZn2Xyh9Ec
[125]	Dao T, Haziza D, Massa F, et al. Flash-Decoding for long-context inference[EB/OL]. 2023[2024-02-03]. https://‌pytorch.‌org/‌blog/‌flash-decoding/
[126]	Hong Ke, Dai Guohao, Xu Jiaming, et al. FlashDecoding++: Faster large language model inference with asynchronization, flat GEMM optimization, and heuristics[C/OL]//Proc of Machine Learning and Systems. 2024: 148−161[2024-09-12]. https://‌proceedings.‌mlsys.‌org/‌paper_files/‌paper/‌2024/‌hash/‌5321b1dabcd2be188d796c21b733e8c7-‌Abstract-‌Conference. ‌html
[127]	Lai Ruihang, Shao Junru, Feng Siyuan, et al. Relax: Composable abstractions for end-to-end dynamic machine learning[J]. arXiv preprint, arXiv: 2311.02103, 2023
[128]	Tillet P, Kung H T, Cox D. Triton: An intermediate language and compiler for tiled neural network computations[C]//Proc of the 3rd ACM SIGPLAN Int Workshop on Machine Learning and Programming Languages. New York: ACM, 2019: 10−19
[129]	Feng Siyuan, Hou Bohan, Jin Hongyi, et al. TensorIR: An abstraction for automatic tensorized program optimization[C]//Proc of the 28th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems: Volume 2. New York: ACM, 2023: 804−817
[130]	Liu Zichang, Wang Jue, Dao T, et al. Deja Vu: Contextual sparsity for efficient LLMs at inference time[C]//Proc of the 40th Int Conf on Machine Learning. New York: PMLR, 2023: 22137−22176
[131]	Sheng Ying, Zheng Lianmin, Yuan Binhang, et al. FlexGen: High-throughput generative inference of large language models with a single GPU[C]//Proc of the 40th Int Conf on Machine Learning. New York: PMLR, 2023: 31094−31116
[132]	Song Yixin, Mi Zeyu, Xie Haotong, et al. PowerInfer: Fast large language model serving with a consumer-grade GPU[J]. arXiv preprint, arXiv: 2312.12456, 2023
[133]	Yi Rongjie, Guo Liwei, Wei Shiyun, et al. EdgeMoE: Fast on-device inference of MoE-based large language models[J]. arXiv preprint, arXiv: 2308.14352, 2023
[134]	Awais M, Naseer M, Khan S, et al. Foundational models defining a new era in vision: A survey and outlook[J]. arXiv preprint, arXiv: 2307.13721, 2023
[135]	Tang Shengkun, Wang Yaqing, Kong Zhenglun, et al. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model[C]//Proc of the 44th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 10781−10791
[136]	Li Zi, Tian Lin, Mok C W, et al. Samconvex: Fast discrete optimization for ct registration using self-supervised anatomical embedding and correlation pyramid[G]//Proc of the 26th Medical Image Computing and Computer Assisted Intervention(MICCAI 2023). Berlin: Springer, 2023: 559−569
[137]	Zhou Chong, Loy C C, Dai Bo. Extract free dense labels from CLIP[C]//Proc of the 17th Computer Vision(ECCV 2022). Berlin: Springer, 2022: 696−712
[138]	Sanghi A, Chu Hang, Lambourn J G, et al. Clip-forge: Towards zero-shot text-to-shape generation[C]//Proc of the 2022 IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 18603−18613
[139]	Vaswani A, Shazeer N, Parmar N, et al. Attention is All you Need[C]//Proc of the 31st Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2017: 5999−6009
[140]	InternLM. LMDeploy[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌InternLM/‌lmdeploy
[141]	Microsoft. DeepSpeed-MII[EB/OL]. 2022[2024-02-04]. https://‌github.‌com/‌microsoft/‌DeepSpeed-MII
[142]	NVIDIA. TensorRT-LLM[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌NVIDIA/‌TensorRT-LLM
[143]	vLLM Team. vLLM[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌vllm-project/‌vllm
[144]	Lin Ji, Chen Weiming, Lin Yujun, et al. MCUNet: Tiny deep learning on IoT devices[C]//Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2020: 11711−11722
[145]	Neuralmagic. DeepSparse[EB/OL]. 2021[2024-02-04]. https://‌github.‌com/‌neuralmagic/‌deepsparse
[146]	李双峰. TensorFlow Lite:端侧机器学习框架[J]. 计算机研究与发展,2020,57(9):1839−1853 doi: 10.7544/issn1000-1239.2020.20200291 Li Shuangfeng. TensorFlow lite: On-device machine learning framework[J]. Journal of Computer Research and Development, 2020, 57(9): 1839−1853 (in Chinese) doi: 10.7544/issn1000-1239.2020.20200291
[147]	PyTorch Team. PyTorch ExecuTorch[EB/OL]. 2023[2024-05-28]. https://‌pytorch.‌org/‌executorch
[148]	Alibaba. MNN[EB/OL]. 2019[2024-06-30]. https://‌github.‌com/‌alibaba/‌MNN
[149]	Tencent. ncnn[EB/OL]. 2017[2024-05-30]. https://‌github.‌com/‌Tencent/‌ncnn
[150]	MLC Team. MLC LLM[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌mlc-ai/‌mlc-llm
[151]	Gerganov G. llama. cpp[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌ggerganov/‌llama.cpp
[152]	Karpathy A. llama2. c[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌karpathy/‌llama2.c
[153]	Mllm Team. mllm[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌UbiquitousLearning/‌mllm
[154]	Intel. Intel Extension for Transformers[EB/OL]. 2022[2024-02-04]. https://‌github.‌com/‌intel/‌intel-extension-for-transformers
[155]	Megvii Inc. InferLLM[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌MegEngine/‌InferLLM
[156]	MIT Han Lab. TinyChatEngine[EB/OL]. 2023[2024-02-04]. https://‌github.‌com/‌mit-han-lab/‌TinyChatEngine
[157]	NVIDIA. NanoLLM[EB/OL]. 2024[2024-04-28]. https://‌github.‌com/‌dusty-nv/‌NanoLLM
[158]	Shazeer N. Fast transformer decoding: One write-head is all you need[J]. arXiv preprint, arXiv: 1911.02150, 2019
[159]	Ainslie J, Lee-Thorp J, de Jong M, et al. GQA: Training generalized multi-query transformer models from multi-head checkpoints[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 4895−4901
[160]	Choromanski K M, Likhosherstov V, Dohan D, et al. Rethinking attention with performers[C/OL]//Proc of the 9th Int Conf on Learning Representations. OpenReview. net, 2021[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌Ua6zuk0WRH
[161]	Shazeer N. Glu variants improve transformer[J]. arXiv preprint, arXiv: 2002.05202, 2020
[162]	Lepikhin D, Lee H, Xu Yuanzhong, et al. GShard: Scaling giant models with conditional computation and automatic sharding[C/OL]//Proc of the 9th Int Conf on Learning Representations. OpenReview. net, 2021[2024-09-10]. https://‌openreview.‌net/‌forum?‌id=‌qrwe7XHTmYb
[163]	Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(1): 120: 5232−120: 5270
[164]	Gu A, Dao T. Mamba: Linear-time sequence modeling with selective state spaces[J]. arXiv preprint, arXiv: 2312.00752, 2023
[165]	Peng Bo, Alcaide E, Anthony Q, et al. RWKV: Reinventing RNNs for the transformer era[C]//Proc of the Findings of the Association for Computational Linguistics(EMNLP 2023). Stroudsburg, PA: ACL, 2023: 14048−14077
[166]	Sun Yutao, Dong Li, Huang Shaohan, et al. Retentive network: A successor to transformer for large language models[J]. arXiv preprint, arXiv: 2307.08621, 2023
[167]	徐志伟,曾琛,朝鲁,等. 面向控域的体系结构:一种智能万物互联的体系结构风格[J]. 计算机研究与发展,2019,56(1):90−102 doi: 10.7544/issn1000-1239.2019.20180775 Xu Zhiwei, Zeng Chen, Zhao Lu, et al. Domain oriented architecture: An architectural style of intelligent interconnection of all things[J]. Journal of Computer Research and Development, 2019, 56(1): 90−102 (in Chinese) doi: 10.7544/issn1000-1239.2019.20180775
[168]	李国杰. 对大数据的再认识[J]. 大数据,2015,1(1):8−16 doi: 10.11959/j.issn.2096-0271.2015.01.001 Li Guojie. Further understanding of big data[J]. Big Data, 2015, 1(1): 8−16 (in Chinese) doi: 10.11959/j.issn.2096-0271.2015.01.001
[169]	Woisetschläger H, Isenko A, Wang Shiqiang, et al. Federated fine-tuning of llms on the very edge: The good, the bad, the ugly[C]//Proc of the 8th Workshop on Data Management for End-to-End Machine Learning. New York: ACM, 2024: 39−50
[170]	Yang Chengxu, Xu Mengwei, Wang Qipeng, et al. Flash: Heterogeneity-aware federated learning at scale[J]. IEEE Transactions on Mobile Computing, 2024, 23(1): 483−500 doi: 10.1109/TMC.2022.3214234
[171]	Lu Wang, Hu Xixu, Wang Jindong, et al. FedCLIP: Fast generalization and personalization for CLIP in federated learning[J]. IEEE Data Engineering Bulletin, 2023, 46(1): 52−66
[172]	矣晓沅,谢幸. 大模型道德价值观对齐问题剖析[J]. 计算机研究与发展,2023,60(9):1926−1945 doi: 10.7544/issn1000-1239.202330553 Yi Xiaoyuan, Xie Xing. An analysis of the alignment of moral values in the large model[J]. Journal of Computer Research and Development, 2023, 60(9): 1926−1945 (in Chinese) doi: 10.7544/issn1000-1239.202330553

[1]	Gong Xiaohang, Jiang Binze, Chen Xianglan, Gao Yinkang, Li Xi. Survey of Real-Time Computer System Architecture[J]. Journal of Computer Research and Development, 2023, 60(5): 1021-1036. DOI: 10.7544/issn1000-1239.202220731
[2]	Zhu Yi’an, Shi Xianchen, Yao Ye, Li Lian, Ren Pengyuan, Dong Weizhen, Li Jiayu. A WCET Analysis Method for Multi-Core Processors with Multi-Tier Coherence Protocol[J]. Journal of Computer Research and Development, 2023, 60(1): 30-42. DOI: 10.7544/issn1000-1239.202111244
[3]	Wang Chao, Chen Xianglan, Zhang Bo, Li Xi, Wang Chao, Zhou Xuehai. A Real-Time Processor Model with Timing Semantics[J]. Journal of Computer Research and Development, 2021, 58(6): 1176-1191. DOI: 10.7544/issn1000-1239.2021.20210157
[4]	Zhu Yi, Xiao Fangxiong, Zhou Hang, Zhang Guangquan. Method for Modeling and Analyzing Software Energy Consumption of Embedded Real-Time System[J]. Journal of Computer Research and Development, 2014, 51(4): 848-855.
[5]	Zhou Hang, Huang Zhiqiu, Zhu Yi, Xia Liang, Liu Linyuan. Real-Time Systems Contact Checking and Resolution Based on Time Petri Net[J]. Journal of Computer Research and Development, 2012, 49(2): 413-420.
[6]	Zhou Hang, Huang Zhiqiu, Hu Jun, Zhu Yi. Real-Time System Resource Conflict Checking Based on Time Petri Nets[J]. Journal of Computer Research and Development, 2009, 46(9): 1578-1585.
[7]	Guo Meng, Jian Fangjun, Zhang Qin, Xu Bin, Wang Zhensong, Han Chengde. FPGA-Based Real-Time Imaging System for Spaceborne SAR[J]. Journal of Computer Research and Development, 2007, 44(3).
[8]	Hu Xiao, Li Xi, and Gong Yuchang. High-Level Low-Power Synthesis of Real-Time Systems Using Time Petri Nets[J]. Journal of Computer Research and Development, 2006, 43(1): 176-184.
[9]	Li Guohui, Wang Hongya, Liu Yunsheng. Updates Dissemination in Mobile Real-Time Database Systems[J]. Journal of Computer Research and Development, 2005, 42(11): 2004-2009.
[10]	Zhu Xiangbin and Tu Shiliang. Analysis and Research of a Window-Constrained Real-Time System with Constraints[J]. Journal of Computer Research and Development, 2005, 42(8): 1445-1451.