Bandwidth-Efficient Edge-Cloud Collaborative Inference for Evolving Large Language Models
-
-
Abstract
Deploying large language models in mobile and edge computing environments faces multiple challenges, including limited on-device resources, scarce wireless bandwidth, and the continual evolution of cloud-side models. Although speculative decoding based edge-cloud collaborative inference can reduce end-to-end generation latency by using a lightweight draft model at the edge and a target model in the cloud for parallel verification, existing methods usually rely on a tightly coupled relationship between the two models. In practical systems, frequent updates of the cloud model often require repeated synchronization of the edge-side draft model, leading to substantial communication overhead, higher latency, and limited scalability. To address this issue, ECSpec is proposed as a communication-efficient edge-cloud collaborative inference framework for model-evolving scenarios. Its key idea is a shared backbone architecture that enables a single static edge-side draft model to remain compatible with a family of continuously evolving cloud-side target models, thereby avoiding repeated retraining or weight downloading at the edge and significantly reducing communication and maintenance costs. In addition, a channel-aware adaptive speculation mechanism is introduced to dynamically adjust the draft length according to real-time channel conditions and energy budgets, achieving a better tradeoff between inference efficiency and energy consumption. Experimental results demonstrate that ECSpec delivers more stable and efficient collaborative LLM inference under communication-constrained edge environments.
-
-