SW39000处理器上顺序任务流多从核辅助并行构图算法

傅游; 贾淑慧; 陈莉; 花嵘; 杜云龙; 高希然

doi:10.7544/issn1000-1239.202550166

SW39000处理器上顺序任务流多从核辅助并行构图算法

Multi-Slave Core Assisted Parallel Composition Algorithm for Sequential Task Flows on the SW39000 Processor

摘要

摘要: 顺序任务流（sequential task flow，STF）将对共享数据的访问表示为任务之间的依赖关系，STF运行时系统通过任务构造、依赖分析和任务依赖图（task dependence graph，TDG）生成、任务调度实现异步并行，这3个环节的开销直接影响并行程序的性能. 目前以STF为核心的AceMesh运行时系统，在SW39000处理器上仅使用单主核构图、多从核执行的方式. 然而，SW39000处理器离散访存性能较弱，细粒度任务构图离散访存增多，构图更容易成为瓶颈. 对此，提出了一种利用多从核辅助主核进行构图的算法. 首先，分析在依赖分析和TDG生成过程中的并行性，在SW39000处理器上实现了一种基于胖任务依赖图（fatTDG）的多核辅助并行构图算法PFBH（parallelized fatTDG building algorithm with helpers）并进行优化. 其次，针对线程间的主存资源竞争问题，提出构图与执行并行中从核资源调节方法及参数选择. 最终，在5类典型应用下进行实验测试. 与单核串行构图系统相比，在细粒度任务场景下最高加速为1.75倍；与SW39000处理器上的OpenACC模型相比，AceMesh最高可达2倍加速.

Abstract: Sequential task flow (STF) represents the access to shared data as dependencies between tasks. The STF runtime system achieves asynchronous parallelism through task construction, dependency analysis, and task dependence graph (TDG) generation, as well as task scheduling. The overhead of these three steps directly affects the performance of parallel programs. The current AceMesh runtime system, with STF at its core, employs a single master core architecture and multiple slave cores for execution on SW39000 processor. However, the discrete memory access performance of SW39000 processor is weak, and the composition of fine-grained tasks increases the discrete memory access, making the composition more likely to become a bottleneck. In this regard, we propose an algorithm that uses multiple auxiliary cores to assist the main core in patterning. First, we analyze the parallelism in the dependency analysis and TDG generation process, and implement a multi-core assisted parallel graph construction algorithm, parallelized fatTDG building algorithm with helpers (PFBH), based on the fat task dependency graph fatTDG on SW39000 processor, and optimize it. Secondly, in response to the problem of main memory resource contention among threads, a method for adjusting the resources of the subordinate cores and parameter selection during parallel graph construction and execution is proposed. Finally, experiments are conducted under five typical applications: compared with a single-core serial graph construction system, the acceleration ratio could achieve up to 1.75 times in fine-grained task scenarios; compared with the OpenACC model on SW39000 processor, AceMesh could achieve up to 2 times acceleration.

HTML全文

参考文献(28)

施引文献

资源附件(1)