高级检索

    多核机群主节点并发发送数据的可分负载调度

    Divisible Loads Scheduling Using Concurrent Sending Data on Multi-Core Cluster

    • 摘要: 对于节点计算、通信与存储能力不同、节点由多个多核处理器(多个片上多处理器)组成且共享L3 cache的机群系统,采取计算与传输重叠模式,提出了主节点以多进程方式并发发送数据给从节点的可分负载调度模型.该调度模型自适应节点具有不同的计算、通信和存储能力,动态计算、确定调度轮数和每轮调度分配给各从节点的负载块规模,以平衡各节点的计算负载、减少节点之间的通信开销,缩短任务调度长度.依据各节点中的L3 cache,L2 cache和L1 cache的可用存储容量,提出了对节点主存中接收到的负载块进行多级缓存划分的数据分配方法,以确保分配给节点中各个多核处理器、各个内核的负载平衡.基于提出的多核机群节点间可分负载调度模型和节点内多级存储数据分配方法,设计实现了节点拥有多个多核处理器的异构机群上通信和存储高效的k-选择并行算法.在曙光TC5000A多核机群系统上,测试了主节点并行与串行发送数据给从节点的任务调度方式、各级缓存利用率、每个核心执行不同数目的线程对并行算法运行性能的影响.实验结果表明:基于主节点并发发送数据给从节点的调度模型设计的k-选择并行算法,其运行性能优于基于主节点串行发送数据给从节点的调度模型设计的k-选择并行算法;L3 cache和L2 cache利用率大小对算法运行性能影响较大;当L3 cache,L2 cache和L1 cache利用率取其优化组合值、每个核心运行3个线程时,算法所需的运行时间最短.

       

      Abstract: By applying the overlapped computation and communication mode, a multi-round scheduling divisible loads model is proposed on the heterogeneous cluster that each node has multiple multi-core processors and shared L3 cache, in which master node executes multiple processes to send concurrently data to all the slave nodes. The presented scheduling model can adapt to different computation speed, communication rate and memory capacity of each node, compute dynamically the number of scheduling rounds and the size of the data block to be assigned to each node in each round scheduling to balance the loads among nodes, and reduce the communication overhead and shorten the scheduling length. According to the usable capacity of L1 cache, L2 cache and L3 cache in each node, this paper presents a multi-level cache partitioning model for the received load block in main memory of each node to balance the loads among multiple multi-core processors and the loads among processing cores. By applying the presented multi-round loads scheduling model and multi-level data allocation method, the communication and cache-efficient parallel algorithms for solving k-selection problem are designed on the heterogeneous cluster that each node has multiple multi-core processors. The execution performance of the presented parallel algorithm is evaluated on Sugon TC5000A cluster system by different methods that master node sends data to salve nodes, different use rate for L3 cache, L2 cache and L1 cache, and different number of running threads. The experimental results show that when master node sends concurrently the data to salve nodes, the execution performance of the parallel k-selection algorithm is superior to the performance of the algorithm when master node sends serially the data to salve nodes in each round scheduling; the use rate of L3 cache and L2 cache will impact remarkably the performance of the algorithm; when each core runs 3 threads using the optimized combination value of utilization rate of L3 cache, L2 cache and L1 cache, the required execution time of the algorithm is the shortest.

       

    /

    返回文章
    返回