数据中心网络RDMA拥塞控制技术综述

张毓涛; 杨惠; 李韬; 黄曼蒂; 李天云; 孙志刚

doi:10.7544/issn1000-1239.202440623

数据中心网络RDMA拥塞控制技术综述

Survey of RDMA Congestion Control Techniques for Data Center Networks

摘要

摘要: 拥塞控制是实现高性能数据中心网络的关键技术之一，影响吞吐量、延迟、丢包率等重要网络性能指标。过去20年间，随着数据中心规模不断扩大，上层应用对网络性能的要求不断提高，基于无损底层网络的远程直接内存访问（remote direct memory access，RDMA）技术在数据中心的部署受到了业内广泛关注。然而，基于优先级的流控（priority-based flow control，PFC）机制在维护无损网络的同时会引入头阻塞等问题，导致网络性能下降甚至网络瘫痪。作为实现无损网络的关键辅助手段，如何设计实用的RDMA拥塞控制机制成为了热点问题。通过将拥塞控制过程划分为拥塞感知与拥塞调整，全面综述了该领域的研究成果：首先从显式反馈与延迟的角度详细阐述并总结了不同的拥塞感知代表算法；其次从速率和窗口的维度对拥塞调整代表算法进行了详细介绍并对其优缺点进行了总结；而后补充了部分算法的优化工作以及基于强化学习方法的拥塞控制算法；最后总结并讨论了该领域存在的挑战。

Abstract: Congestion control is one of the key technologies for realizing high-performance data center networks, and it affects important network performance indicators such as throughput, latency, and packet loss rate. Over the past 20 years, with the continuous expansion of the scale of data centers and the increasing requirements of upper-layer applications for network performance, the deployment of remote direct memory access (RDMA) technology based on lossless underlying networks has received widespread attention within the industry. However, the priority-based flow control (PFC) mechanism, while maintaining a lossless network, introduces problems such as head-of-line blocking, leading to a decline in network performance or even network paralysis. As a crucial auxiliary means for achieving a lossless network, how to design a practical RDMA congestion control mechanism has become a hot issue. By dividing the congestion control process into congestion awareness and congestion regulation, this paper comprehensively reviews the research achievements in this field: Firstly, from the perspectives of explicit feedback and latency, different representative algorithms for congestion awareness are elaborated and summarized in detail; Secondly, representative algorithms for congestion regulation are introduced in detail from the dimensions of rate and window, and their advantages and disadvantages are summarized; Some optimization work of algorithms and congestion control algorithms based on reinforcement learning methods are supplemented; Finally, the existing challenges in this field are summarized and discussed.

HTML全文

参考文献(89)

施引文献

资源附件(0)