高级检索

    面向演进式大语言模型的带宽高效边云协同推理方法

    Bandwidth-Efficient Edge-Cloud Collaborative Inference for Evolving Large Language Models

    • 摘要: 在移动与边缘计算环境中部署大语言模型面临端侧算力与存储受限、无线带宽紧张以及云端模型持续演进等挑战。现有基于投机解码的边云协同推理虽可借助边缘草稿模型与云端目标模型的并行协作降低生成时延,但通常要求二者紧密匹配。当云端模型频繁更新时,往往需要反复同步边缘模型,导致额外通信开销、时延增加,并削弱系统可扩展性。为此,提出ECSpec通信高效边云协同推理框架。其核心在于共享骨干设计,使静态边缘草稿模型可适配一族不断演进的云端目标模型,从而避免端侧重复训练与权重反复下载,显著降低通信与维护成本。进一步结合时变信道与设备异构特性,设计信道感知自适应投机机制,依据实时信道状态和能耗预算动态调整草稿长度,在效率与能耗之间实现平衡。实验结果表明,ECSpec在通信受限的边缘环境中可实现更稳定、高效的大语言模型协同推理。

       

      Abstract: Deploying large language models in mobile and edge computing environments faces multiple challenges, including limited on-device resources, scarce wireless bandwidth, and the continual evolution of cloud-side models. Although speculative decoding based edge-cloud collaborative inference can reduce end-to-end generation latency by using a lightweight draft model at the edge and a target model in the cloud for parallel verification, existing methods usually rely on a tightly coupled relationship between the two models. In practical systems, frequent updates of the cloud model often require repeated synchronization of the edge-side draft model, leading to substantial communication overhead, higher latency, and limited scalability. To address this issue, ECSpec is proposed as a communication-efficient edge-cloud collaborative inference framework for model-evolving scenarios. Its key idea is a shared backbone architecture that enables a single static edge-side draft model to remain compatible with a family of continuously evolving cloud-side target models, thereby avoiding repeated retraining or weight downloading at the edge and significantly reducing communication and maintenance costs. In addition, a channel-aware adaptive speculation mechanism is introduced to dynamically adjust the draft length according to real-time channel conditions and energy budgets, achieving a better tradeoff between inference efficiency and energy consumption. Experimental results demonstrate that ECSpec delivers more stable and efficient collaborative LLM inference under communication-constrained edge environments.

       

    /

    返回文章
    返回