Abstract:
Sequential task flow (STF) represents the access to shared data as dependencies between tasks. The STF runtime system achieves asynchronous parallelism through task construction, dependency analysis, and task dependence graph (TDG) generation, as well as task scheduling. The overhead of these three steps directly affects the performance of parallel programs. The current AceMesh runtime system, with STF at its core, employs a single master core architecture and multiple slave cores for execution on SW39000 processor. However, the discrete memory access performance of SW39000 processor is weak, and the composition of fine-grained tasks increases the discrete memory access, making the composition more likely to become a bottleneck. In this regard, we propose an algorithm that uses multiple auxiliary cores to assist the main core in patterning. First, we analyze the parallelism in the dependency analysis and TDG generation process, and implement a multi-core assisted parallel graph construction algorithm, parallelized fatTDG building algorithm with helpers (PFBH), based on the fat task dependency graph fatTDG on SW39000 processor, and optimize it. Secondly, in response to the problem of main memory resource contention among threads, a method for adjusting the resources of the subordinate cores and parameter selection during parallel graph construction and execution is proposed. Finally, experiments are conducted under five typical applications: compared with a single-core serial graph construction system, the acceleration ratio could achieve up to 1.75 times in fine-grained task scenarios; compared with the OpenACC model on SW39000 processor, AceMesh could achieve up to 2 times acceleration.