面向边缘算力网络的混合专家模型协同推理系统

叶盛源; 梁涵; 陈旭

doi:10.7544/issn1000-1239.202660131

摘要: 专家混合（mixture of expert，MoE）模型架构已成为扩展大语言模型的广泛认可范式，并被众多前沿模型所采用。然而，MoE模型参数量巨大，其内存占用往往超出单一边缘平台的容量，在资源受限边缘设备上部署面临严峻挑战。现有本地MoE服务系统采用专家交换策略，将非活跃专家卸载至二级存储并按需加载。随着大语言模型支持的上下文窗口不断扩展，用户输入也相应变长，预填充阶段面临关键瓶颈：长输入序列几乎激活所有专家，全量专家交换产生大量I/O流量，导致首词时延（time to first token，TTFT）显著延长。在智能家居、智能工厂等典型边缘场景中，平板电脑、笔记本电脑及各类家庭智能设备等多类可信终端通过局域网彼此互联，共同构成一张泛在的边缘算力网络（edge computility network，ECN）。边缘算力网络中各节点均携带大量闲置的算力、内存容量与I/O带宽资源，为汇聚分布式资源、突破单设备瓶颈提供了重要机遇。Sirius是一种面向边缘算力网络的协作MoE推理系统，通过跨节点的协同调度汇聚聚合资源，加速预填充阶段并降低TTFT。具体而言，设计了3项关键技术：1）基于交错划分的序列并行策略，解决因果注意力计算中的工作负载不均衡问题，最大化通信-计算重叠；2）自适应Token调度机制，联合考虑计算开销、交换时延和专家缓冲区缓存状态，充分利用集体I/O带宽并实现负载均衡；3）推测性专家预取方案，利用相邻层隐藏状态的高相似性提前预测专家激活，将专家加载与计算重叠，有效隐藏I/O时延。在真实边缘算力网络平台上，针对 Mixtral-8x7B 与 Qwen3-30B-A3B 的实验表明，Sirius 相比现有基线方法实现了 1.2~4.0 倍的推理吞吐量提升。

Abstract: Mixture-of-Experts (MoE) architectures have become a widely recognized paradigm for scaling large language models (LLMs), and have been adopted by many frontier models deployed at the network edge. However, deploying MoE-based LLMs on resource-constrained edge devices remains challenging due to their massive parameter sizes, which often exceed the memory capacity of individual platforms. Existing in-situ MoE serving systems adopt expert swapping strategies that offload inactive experts to secondary storage and load them on demand. As LLMs support increasingly longer context windows with correspondingly longer user inputs, the prefilling phase faces a critical bottleneck: long input sequences activate nearly all experts, and the resulting full-scale expert swapping generates substantial I/O traffic, leading to prolonged Time-to-First-Token (TTFT) latency. Typical edge environments, such as smart homes and smart factories, comprise multiple trusted interconnected devices, including tablets, laptops, and smart home appliances connected via local area networks. These devices collectively form a pervasive edge computility network (ECN), where each node carries underutilized computing power, memory capacity, and I/O bandwidth. This motivates us to harness the aggregated resources of the ECN through coordinated cross-node scheduling. This paper presents Sirius, a collaborative MoE inference system for edge computility networks that pools the collective resources across ECN nodes to accelerate the prefilling phase and reduce TTFT. Specifically, Sirius proposes three key techniques: (1) an interleaved sequence partitioning strategy that resolves workload imbalance in causal attention computation and maximizes communication-computation overlap; (2) an adaptive Token scheduling mechanism that jointly considers computation overhead, swap latency, and expert buffer cache states to fully utilize collective I/O bandwidth while achieving load balance; and (3) a speculative expert prefetching scheme that exploits the high similarity between adjacent-layer hidden states to overlap expert loading with computation, effectively hiding I/O latency. Experiments on real-world edge computility network platforms with Mixtral-8x7B and Qwen3-30B-A3B demonstrate that Sirius achieves 1.2×–4.0× inference throughput improvement over state-of-the-art baselines.

面向边缘算力网络的混合专家模型协同推理系统

Collaborative MoE Inference over Edge Computility Networks