Advanced Search
    Ye Shengyuan, Liang Han, Chen Xu. Collaborative MoE Inference over Edge Computility NetworksJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202660131
    Citation: Ye Shengyuan, Liang Han, Chen Xu. Collaborative MoE Inference over Edge Computility NetworksJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202660131

    Collaborative MoE Inference over Edge Computility Networks

    • Mixture-of-Experts (MoE) architectures have become a widely recognized paradigm for scaling large language models (LLMs), and have been adopted by many frontier models deployed at the network edge. However, deploying MoE-based LLMs on resource-constrained edge devices remains challenging due to their massive parameter sizes, which often exceed the memory capacity of individual platforms. Existing in-situ MoE serving systems adopt expert swapping strategies that offload inactive experts to secondary storage and load them on demand. As LLMs support increasingly longer context windows with correspondingly longer user inputs, the prefilling phase faces a critical bottleneck: long input sequences activate nearly all experts, and the resulting full-scale expert swapping generates substantial I/O traffic, leading to prolonged Time-to-First-Token (TTFT) latency. Typical edge environments, such as smart homes and smart factories, comprise multiple trusted interconnected devices, including tablets, laptops, and smart home appliances connected via local area networks. These devices collectively form a pervasive edge computility network (ECN), where each node carries underutilized computing power, memory capacity, and I/O bandwidth. This motivates us to harness the aggregated resources of the ECN through coordinated cross-node scheduling. This paper presents Sirius, a collaborative MoE inference system for edge computility networks that pools the collective resources across ECN nodes to accelerate the prefilling phase and reduce TTFT. Specifically, Sirius proposes three key techniques: (1) an interleaved sequence partitioning strategy that resolves workload imbalance in causal attention computation and maximizes communication-computation overlap; (2) an adaptive Token scheduling mechanism that jointly considers computation overhead, swap latency, and expert buffer cache states to fully utilize collective I/O bandwidth while achieving load balance; and (3) a speculative expert prefetching scheme that exploits the high similarity between adjacent-layer hidden states to overlap expert loading with computation, effectively hiding I/O latency. Experiments on real-world edge computility network platforms with Mixtral-8x7B and Qwen3-30B-A3B demonstrate that Sirius achieves 1.2×–4.0× inference throughput improvement over state-of-the-art baselines.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return