Bandwidth-Efficient Edge-Cloud Collaborative Inference for Evolving Large Language Models

Li Yuchen; Kong Rui; Chen Xinran; Yin Dawei; Kong Linghe; Chen Guihai

doi:10.7544/issn1000-1239.202660136

Li Yuchen, Kong Rui, Chen Xinran, Yin Dawei, Kong Linghe, Chen Guihai. Bandwidth-Efficient Edge-Cloud Collaborative Inference for Evolving Large Language ModelsJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202660136

Citation:

Bandwidth-Efficient Edge-Cloud Collaborative Inference for Evolving Large Language Models

Graphical Abstract

Abstract

Abstract

Deploying large language models in mobile and edge computing environments faces multiple challenges, including limited on-device resources, scarce wireless bandwidth, and the continual evolution of cloud-side models. Although speculative decoding based edge-cloud collaborative inference can reduce end-to-end generation latency by using a lightweight draft model at the edge and a target model in the cloud for parallel verification, existing methods usually rely on a tightly coupled relationship between the two models. In practical systems, frequent updates of the cloud model often require repeated synchronization of the edge-side draft model, leading to substantial communication overhead, higher latency, and limited scalability. To address this issue, ECSpec is proposed as a communication-efficient edge-cloud collaborative inference framework for model-evolving scenarios. Its key idea is a shared backbone architecture that enables a single static edge-side draft model to remain compatible with a family of continuously evolving cloud-side target models, thereby avoiding repeated retraining or weight downloading at the edge and significantly reducing communication and maintenance costs. In addition, a channel-aware adaptive speculation mechanism is introduced to dynamically adjust the draft length according to real-time channel conditions and energy budgets, achieving a better tradeoff between inference efficiency and energy consumption. Experimental results demonstrate that ECSpec delivers more stable and efficient collaborative LLM inference under communication-constrained edge environments.

FullText(HTML)

References (35)

Cited By

Turn off MathJax

Article Contents

Bandwidth-Efficient Edge-Cloud Collaborative Inference for Evolving Large Language Models

Abstract

Catalog

Export File

Citation

Format

Content