Abstract:
Traditional MPI (message passing interface) collectives are implemented by point-to-point messages, and have poor performance. Hardware supported collectives have attracted more and more attention due to their high performance and low CPU utilization. Aggregate tree has crucial impact on the performance of hardware supported collectives. We study the factors that affect the performance of hardware supported collectives, and propose a cost model for hardware supported collectives and an efficient method to create aggregate trees. The method includes three parts. Firstly, we choose appropriate aggregate tree type and breadth according to the operation type and the size of aggregate messages to do tradeoff between network transmission time and data processing time. Secondly, we propose a method to create hierarchical minimum height aggregate tree of type Ⅰ, which reduces the number of inter-group communication. Thirdly, we put forward a method to create the minimum cost aggregate tree of type Ⅱ, which minimizes the number of used switches. In the Sunway interconnection network, we test the proposals. In the presence of network noise, the message latency of the hierarchical minimum height aggregate tree of type Ⅰ is reduced by 24%−89% compared with that of the traditional method. The aggregate entries used by the minimum cost aggregate tree of type Ⅱ for typical communication patterns are reduced by 90% compared with that of the traditional method.