无丢失网络流量管理综述

张乙然; 王尚广; 任丰原

doi:10.7544/issn1000-1239.202440096

摘要: 近年来，无丢失网络在高性能计算、数据中心等领域得到了广泛应用. 无丢失网络通过链路层流量控制技术保障网内交换机不会因缓存溢出而丢包，避免了数据丢失与重传，极大提高了应用的时延和吞吐量性能. 然而，链路层流量控制带来的负面效应（拥塞扩展、死锁等）使得无丢失网络的大规模部署面临着诸多挑战. 因此，引入流量管理技术来提升无丢失网络的可扩展性得到了更多关注. 对应用于高性能计算领域和数据中心领域的典型无丢失网络InfiniBand和无丢失以太网的流量管理研究进展进行系统性综述，首先介绍链路层流量控制的负面影响和流量管理的目标，总结无丢失网络传统的流量管理架构. 然后根据流量管理技术路线（拥塞控制、拥塞隔离、多路径负载均衡等）以及驱动的位置（发送端驱动、接收端驱动等），对InfiniBand和无丢失以太网流量管理的最新研究进展进行分类和阐述，分析对应的优势以及局限性. 最后指出无丢失网络流量管理进一步研究中需要着重探索的问题，包括无丢失网络流量管理统一架构、主机内与网络联合流量管理以及面向领域应用的流量管理.

Abstract: Lossless networks are increasingly widely used in high performance computing (HPC), data centers and other fields. Lossless networks use link layer flow control to ensure that packets will not be dropped by switches due to buffer overflow, thus avoiding loss retransmission and greatly improving the latency and throughput performance of applications. However, the negative effects introduced by link layer flow control (congestion spreading, deadlock, etc.) impose challenges for the large-scale deployment of lossless networks. Therefore, the introduction of traffic management technology to improve the scalability of lossless networks has received great attention. We systematically review the research progress of traffic management in typical lossless networks used in HPC and data centers including InfiniBand and lossless Ethernet. First, we introduce the negative impact of link layer flow control and the goals of traffic management, and summarize the traditional traffic management architecture of lossless networks. Then according to the traffic management technical route (congestion control, congestion isolation, load balancing etc.) and the driven location (sender-driven, receiver-driven, etc.), we classify and elaborate on the latest research progress of InfiniBand and lossless Ethernet traffic management, and analyze the corresponding advantages and limitations. Finally, we point out the issues that need to be explored in further research on lossless network traffic management, including unified architecture for traffic management, joint congestion management within the host and the network, and traffic management for domain applications.

无丢失网络流量管理综述

Survey on Traffic Management in Lossless Networks