Citation: | Chen Shuping, Wei Hongmei, Wang Fei, Li Yi, He Wangquan, Qi Fengbin. Method to Create Aggregate Tree for Hardware Supported Collectives[J]. Journal of Computer Research and Development, 2024, 61(2): 503-517. DOI: 10.7544/issn1000-1239.202220684 |
Traditional MPI (message passing interface) collectives are implemented by point-to-point messages, and have poor performance. Hardware supported collectives have attracted more and more attention due to their high performance and low CPU utilization. Aggregate tree has crucial impact on the performance of hardware supported collectives. We study the factors that affect the performance of hardware supported collectives, and propose a cost model for hardware supported collectives and an efficient method to create aggregate trees. The method includes three parts. Firstly, we choose appropriate aggregate tree type and breadth according to the operation type and the size of aggregate messages to do tradeoff between network transmission time and data processing time. Secondly, we propose a method to create hierarchical minimum height aggregate tree of type Ⅰ, which reduces the number of inter-group communication. Thirdly, we put forward a method to create the minimum cost aggregate tree of type Ⅱ, which minimizes the number of used switches. In the Sunway interconnection network, we test the proposals. In the presence of network noise, the message latency of the hierarchical minimum height aggregate tree of type Ⅰ is reduced by 24%−89% compared with that of the traditional method. The aggregate entries used by the minimum cost aggregate tree of type Ⅱ for typical communication patterns are reduced by 90% compared with that of the traditional method.
[1] |
Liao Xiangke, Lu Kai, Yang Canqun, et al. Moving from exascale to zettascale computing: Challenges and techniques[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(10): 1236−1244
|
[2] |
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, version 4.0 [S/OL]. (2021-06-09)[2023-02-25]. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf
|
[3] |
Thakur R, Rabenseifner R, Gropp W. Optimization of collective communication operations in MPICH[J]. International Journal of High Performance Computing Applications, 2005, 19(1): 49−66 doi: 10.1177/1094342005051521
|
[4] |
Chan E, Heimlich M, Purkayastha A, et al. Collective communication: Theory, practice, and experience[J]. Concurrency and Computation: Practice and Experience, 2007, 19(13): 1749−1783 doi: 10.1002/cpe.1206
|
[5] |
Richard L G, Devendar B, Lui Pak, et al. Scalable hierarchical aggregation protocol (SHARP): A hardware architecture for efficient data reduction [C/OL] //Proc of the 1st Int Workshop on Communication Optimizations in HPC. Piscataway, NJ: IEEE, 2016 [2023-02-25]. https://doi.org/10.1109/COMHPC.2016.006
|
[6] |
Sapio A, Abdelaziz I, Aldilaijan A, et al. In-network computation is a dumb idea whose time has come [C] //Proc of the 16th ACM Workshop on Hot Topics in Networks. New York: ACM, 2017: 150−156
|
[7] |
Benson T A. In-network compute: Considered armed and dangerous [C] //Proc of the 17th Workshop on Hot Topics in Operating Systems. New York: ACM, 2019: 216−224
|
[8] |
Chen Dong, Eisley N A, Heidelberger P, et al. The IBM Blue Gene/Q interconnection fabric[J]. IEEE Micro, 2012, 32(1): 32−43 doi: 10.1109/MM.2011.96
|
[9] |
Chen Dong, Eisley N A, Heidelberger P, et al. The IBM Blue Gene/Q interconnection network and message unit [C/OL] //Proc of the 24th Int Conf for High Performance Computing, Networking, Storage, and Analysis. Piscataway, NJ: IEEE, 2011 [2023-02-25]. https://doi.org/10.1145/2063384.2063419
|
[10] |
Chen Dong, Eisley N, Heidelberger P, et al. Looking under the hood of the IBM blue gene/Q network [C/OL] //Proc of the 25th Int Conf for High Performance Computing, Networking, Storage, and Analysis. Piscataway, NJ: IEEE, 2012 [2023-02-25]. https://doi.org/10.1109/SC.2012.72
|
[11] |
Kumar S, Mamidala A, Heidelberger P, et al. Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer[J]. The International Journal of High Performance Computing Applications, 2014, 28(4): 450−464 doi: 10.1177/1094342014552086
|
[12] |
Manjunath G V, Gilad S, Richard L G, et al. Accelerating OpenSHMEM collectives using in-network computing approach [C]// Proc of the 31st Int Symp on Computer Architecture and High Performance Computing. Piscataway, NJ: IEEE, 2019: 212−219
|
[13] |
Mellanox Technologies Ltd. Aggregation Protocol: US, US10284383 [P]. 2019-05-07
|
[14] |
Bharath R, Kaushik K S, Nick S, et al. Scalable MPI collectives using SHARP: Large scale performance evaluation on the TACC Frontera system [C] //Proc of the 1st Workshop on Exascale MPI. Piscataway, NJ: IEEE, 2020: 11−20
|
[15] |
高剑刚,卢宏生,何王全,等. 神威E级原型机互连网络和消息机制[J]. 计算机学报,2021,44(1):222−234
Gao Jiangang, Lu Hongsheng, He Wangquan, et al. The interconnection network and message machinasim of Sunway exascale prototype system[J]. Chinese Journal of Computers, 2021, 44(1): 222−234 (in Chinese)
|
[16] |
Zimmer C, Atchley S, Pankajakshan R, et al. An evaluation of the CORAL interconnects [C/OL] //Proc of the 32nd Int Conf for High Performance Computing, Networking, Storage, and Analysis. Piscataway, NJ: IEEE, 2019[2023-02-25]. https://doi.org/10.1145/3295500.3356166
|
[17] |
胡美勇. 基于“天河”高速互连网络的MPI聚合通信优化[D]. 长沙:国防科学技术大学,2014
Hu Meiyong. MPI collective communication optimization on Tianhe high-speed interconnect[D]. Changsha: National University of Defense Technology, 2014 (in Chinese)
|
[18] |
Rottenstreich O, Yallouz J, Levi L. Isolated trees in multi-tenant fat tree datacenters for in-network computing [C] //Proc of the 27th IEEE Symp on High-Performance Interconnects. Piscataway, NJ: IEEE, 2020: 55−62
|
[19] |
Banerjee S, Kommareddy C, Kar K, et al. Construction of an efficient overlay multicast infrastructure for real-time applications [C] //Proc of the 22nd IEEE INFOCOM. Piscataway, NJ: IEEE, 2003: 1521−1531
|
[20] |
Ho J M, Lee D T, Chang C H, et al. Minimum diameter spanning trees and related problems[J]. SIAM Journal on Computing, 1991, 20(5): 987−997 doi: 10.1137/0220060
|
[21] |
Shi S Y, Turner J S, Waldvogel M. Dimensioning server access bandwidth and multicast routing in overlay networks [C] //Proc of the 10th IEEE Int Workshop on Network and Operating System Support for Digital Audio and Video. Piscataway, NJ: IEEE, 2001: 83−91
|
[22] |
Valerio M, Moser L E, Melliar-Smith P M. Recursively scalable fat-trees as interconnection networks [C]// Proc of the 13th IEEE Annual Int Phoenix Conf on Computers and Communications. Piscataway, NJ: IEEE, 1994: 40−46
|
[23] |
Petrini F, Vanneschi M. k-ary n-trees: High performance networks for massively parallel architectures [C] //Proc of the 11th Int Parallel Processing Symp. Piscataway, NJ: IEEE, 1997: 87−93
|
[24] |
Kim J, Dally W J, Scott S, et al. Technology-driven, highly-scalable dragonfly topology [C] //Proc of the 35th Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2008: 77−88
|
[25] |
Zhao Tianhai, Wang Yunlan, Wang Xu. Optimized reduce communication performance with the tree topology [C] //Proc of the 4th High Performance Computing and Cluster Technologies Conf. New York: ACM, 2020: 165−171
|
[26] |
Tipparaju V, Nieplocha J, Panda D. Fast collective operations using shared and remote memory access protocols on clusters [C/OL] //Proc of the 17th Int Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2003 [2023-02-25]. https://doi.org/10.1109/PDPS.I2003.1213188
|
[27] |
Jain S, Kaleem R, Balmana M G, et al. Framework for scalable intra-node collective operations using shared memory [C/OL] //Proc of the 31st ACM/IEEE Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2018 [2023-02-25]. https://doi.org/10.1109/SC.2018.00032
|
[28] |
Luo Xi, Wu Wei, Bosilca G, et al. HAN: A hierarchical autotuned collective communication framework [C] //Proc of the 22nd Cluster Conf. Piscataway, NJ: IEEE, 2020: 23−34
|
[29] |
Li Shigang, Hoefler T, Snir M. Numa-aware shared-memory collective communication for MPI [C] //Proc of the 22nd Int Symp on High-Performance Parallel and Distributed Computing. New York: ACM, 2013: 85–96
|
[1] | Gao Ruihao, Shi Shunchen, Li Xueqi, Tan Guangming. BeeZip2: High Performance Lossless Data Compression Domain-Specific Accelerator[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550017 |
[2] | Dong Wenkuo, Yin Chunsuo, Zhang Zhimeng, Wang Pengchao, Sha Jiang, Wang Mengya, Zhu Minqi, Liu Hongwei, Liu Yuhang, Hao Qinfen. Yingtian-Lake: A Wafer-Scale General-Purpose Heterogeneous Multi-chiplet Petascale Computer[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550163 |
[3] | Li Rengang, Tang Yinan, Guo Zhenhua, Wang Li, Zong Zan, Yang Guangwen. Performance Modeling and Optimization for Large-Scale Heterogeneous Consistency Integrated Computing System[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550120 |
[4] | Liu Sheng, Lu Kai, Guo Yang, Liu Zhong, Chen Haiyan, Lei Yuanwu, Sun Haiyan, Yang Qianming, Chen Xiaowen, Chen Shenggang, Liu Biwei, Lu Jianzhuang. A Self-Designed Heterogeneous Accelerator for Exascale High Performance Computing[J]. Journal of Computer Research and Development, 2021, 58(6): 1234-1237. DOI: 10.7544/issn1000-1239.2021.20210189 |
[5] | Xie Xianghui, Qian Lei, Wu Dong, Yuan Hao, Li Xiang. Ant Cluster: A Novel High-Efficiency Multipurpose Computing Platform[J]. Journal of Computer Research and Development, 2015, 52(6): 1341-1350. DOI: 10.7544/issn1000-1239.2015.20150201 |
[6] | Zheng Fang, Shen Li, Li Hongliang, Xie Xianghui. Lightweight Error Recovery Techniques of Many-Core Processor in High Performance Computing[J]. Journal of Computer Research and Development, 2015, 52(6): 1316-1328. DOI: 10.7544/issn1000-1239.2015.20150119 |
[7] | Chen Qi, Chen Zuoning, Jiang Jinhu. MDDS: A Method to Improve the Metadata Performance of Parallel File System for HPC[J]. Journal of Computer Research and Development, 2014, 51(8): 1663-1670. DOI: 10.7544/issn1000-1239.2014.20121094 |
[8] | Tu Bibo, Hong Xuehai, Zhan Jianfeng, Fan Jianping. Workflow-Based User Environment for High Performance Computing[J]. Journal of Computer Research and Development, 2007, 44(10): 1717-1723. |
[9] | Zhao Yi, Zhu Peng, Chi Xuebin, Niu Tie, and Cao Zongyan. A Brief View on Requirements and Development of High Performance Computing Application[J]. Journal of Computer Research and Development, 2007, 44(10): 1640-1646. |
[10] | Feng Shengzhong, Tan Guangming, Xu Lin, Sun Ninghui, Xu Zhiwei. Research on the High Performance Algorithms of Dawning 4000H Bioinformatics Specific Machine[J]. Journal of Computer Research and Development, 2005, 42(6): 1053-1058. |