Abstract:
By applying the overlapped computation and communication mode, a multi-round scheduling divisible loads model is proposed on the heterogeneous cluster that each node has multiple multi-core processors and shared L3 cache, in which master node executes multiple processes to send concurrently data to all the slave nodes. The presented scheduling model can adapt to different computation speed, communication rate and memory capacity of each node, compute dynamically the number of scheduling rounds and the size of the data block to be assigned to each node in each round scheduling to balance the loads among nodes, and reduce the communication overhead and shorten the scheduling length. According to the usable capacity of L1 cache, L2 cache and L3 cache in each node, this paper presents a multi-level cache partitioning model for the received load block in main memory of each node to balance the loads among multiple multi-core processors and the loads among processing cores. By applying the presented multi-round loads scheduling model and multi-level data allocation method, the communication and cache-efficient parallel algorithms for solving k-selection problem are designed on the heterogeneous cluster that each node has multiple multi-core processors. The execution performance of the presented parallel algorithm is evaluated on Sugon TC5000A cluster system by different methods that master node sends data to salve nodes, different use rate for L3 cache, L2 cache and L1 cache, and different number of running threads. The experimental results show that when master node sends concurrently the data to salve nodes, the execution performance of the parallel k-selection algorithm is superior to the performance of the algorithm when master node sends serially the data to salve nodes in each round scheduling; the use rate of L3 cache and L2 cache will impact remarkably the performance of the algorithm; when each core runs 3 threads using the optimized combination value of utilization rate of L3 cache, L2 cache and L1 cache, the required execution time of the algorithm is the shortest.