广域联盟算力网的服务质量反馈控制机制

周钰雯; 任棒棒; 何瑞; 郭得科

doi:10.7544/issn1000-1239.202660112

广域联盟算力网的服务质量反馈控制机制

Snapshot-Guided Feedback Control for QoS-Aware Wide-area Computility Federation

摘要

摘要: 联盟算力网旨在协调跨域异构计算资源，同时保留各独立提供商的管理自治权。然而，这种自治性也限制了联盟层的全局可观测性，使各域网关在进行路由决策时无法获得实时域内状态，只能依赖粗粒度且存在滞后的性能摘要。在这种弱可观测和状态滞后的条件下，时延敏感型任务的超时风险容易在少数域和少数窗口内发生结构性聚集。即使联盟整体平均负载仍处于合理水平，局部域也可能在偶发过载窗口中承受显著升高的风险暴露。针对这一问题，提出一种域级SLO反馈机制PDSF。该机制是一种低维闭环调节方法，无需修改域内本地调度器，而是利用窗口级统计摘要动态更新时延敏感型任务的路由代价，从而对基于状态快照的自路由过程施加偏置调节。理论分析表明，在路由响应满足弱单调性的条件下，PDSF具有有界性、遗忘性以及延迟反馈下的局部稳定性。基于Google集群工作负载轨迹的多域仿真实验表明，与现有可部署基线相比，PDSF能够显著缓解长尾过载，降低最差域风险暴露，并在保护时延敏感型任务的同时避免尽力而为型任务性能出现明显退化。进一步的敏感性分析表明，快照滞后与路由集中度共同决定了该机制的适用边界，并表明了测量信息陈旧性与路由反应性如何放大系统风险、压缩低维反馈调节的有效空间。

Abstract: Computility federations coordinate heterogeneous resources across independent providers while preserving administrative autonomy. This autonomy limits federation-level visibility, forcing gateways to route from coarse-grained and delayed performance summaries rather than real-time internal states. Under such weak and lagged observability, deadline-miss risk for latency-critical tasks can become structurally concentrated in a small number of domains and time windows, even when federation-wide averages remain moderate. To address this problem, we propose Per-Domain SLO Feedback (PDSF), a low-dimensional closed-loop mechanism that updates bounded routing tolls from window-level summaries and biases snapshot-driven self-routing without modifying domain-local schedulers. Theoretical analysis shows that, under a weak monotonicity assumption on routing response, PDSF exhibits boundedness, forgetting, and local stability under delayed feedback. Trace-driven experiments on a multi-domain emulator using Google cluster workloads show that, compared with deployable baselines, PDSF significantly alleviates upper-tail overload severity, reduces worst-domain risk exposure, and protects latency-critical traffic without noticeable degradation to best-effort traffic. Sensitivity analysis further shows that snapshot staleness and routing concentration jointly determine the mechanism’s operating boundary and clarifies how stale measurements and aggressive routing responses amplify risk while reducing the effectiveness of low-dimensional feedback.

HTML全文

参考文献(40)

施引文献

资源附件(0)