Abstract:
Lossless networks are increasingly widely used in high performance computing (HPC), data centers and other fields. Lossless networks use link layer flow control to ensure that packets will not be dropped by switches due to buffer overflow, thus avoiding loss retransmission and greatly improving the latency and throughput performance of applications. However, the negative effects introduced by link layer flow control (congestion spreading, deadlock, etc.) impose challenges for the large-scale deployment of lossless networks. Therefore, the introduction of traffic management technology to improve the scalability of lossless networks has received great attention. We systematically review the research progress of traffic management in typical lossless networks used in HPC and data centers including InfiniBand and lossless Ethernet. First, we introduce the negative impact of link layer flow control and the goals of traffic management, and summarize the traditional traffic management architecture of lossless networks. Then according to the traffic management technical route (congestion control, congestion isolation, load balancing etc.) and the driven location (sender-driven, receiver-driven, etc.), we classify and elaborate on the latest research progress of InfiniBand and lossless Ethernet traffic management, and analyze the corresponding advantages and limitations. Finally, we point out the issues that need to be explored in further research on lossless network traffic management, including unified architecture for traffic management, joint congestion management within the host and the network, and traffic management for domain applications.