面向国产多核DSP的张量转置并行优化技术

刘根程; 王庆林; 洪楚河; 彭兴; 夏睿; 梁亚玲; 张庆阳; 车永刚; 刘杰

doi:10.7544/issn1000-1239.202550130

面向国产多核DSP的张量转置并行优化技术

Parallel Optimization Technology of Tensor Transposition for Domestic Multi-Core DSPs

摘要

摘要: 张量转置（tensor transposition）作为基础张量运算原语，广泛应用于信号处理、科学计算以及深度学习等各种领域，在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显，基于数字信号处理器（digital signal processors，DSPs）的加速器已被集成至通用计算系统。然而，传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP架构。一方面，DSP架构的向量化计算潜力尚未得到充分挖掘；另一方面，其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产多核DSP的架构特点，提出ftmTT算法，并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力，其核心创新包括：1）采用分块策略将高维张量转置转化为多核DSP平台所提供的矩阵转置内核操作；2）提出基于DMA点对点传输的张量数据块访存合并方案来降低数据搬运开销；3）通过双缓冲设计异步重叠转置计算与DMA传输实现计算通信隐藏，最终面向多核DSP实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明，ftmTT张量转置算法取得了最高达理论带宽75.96%的性能，达到FT-M7032平台STREAM带宽99.23%的性能。

Abstract: Tensor transposition is a fundamental computational primitive extensively employed in signal processing, scientific computing, and deep learning applications. With the growing emphasis on energy efficiency in high-performance computing, digital signal processors (DSPs) have emerged as promising accelerators. However, existing tensor transposition libraries designed for conventional CPU/GPU architectures fail to fully exploit the architectural advantages of modern DSP platforms, particularly their specialized memory hierarchies and vector processing capabilities. To address the architectural characteristics of domestic DSP platforms, we proposes ftmTT, a high-performance tensor transposition algorithm specifically optimized for multi-core DSP architectures. The key technical contributions include: 1) An intelligent tiling strategy that systematically decomposes high-dimensional tensor operations into efficient matrix transposition kernels native to DSP hardware; 2) A novel DMA-based data access optimization scheme that significantly reduces memory transfer overhead through smart data block consolidation; 3) A double-buffering design that asynchronously overlaps transposition computation with DMA transfers, enabling computation-communication overlap and ultimately achieving high-performance parallel tensor transposition on DSP platforms. Experiments on the domestic multi-core DSP platform FT-M7032 demonstrate that ftmTT tensor transposition algorithm achieves up to 75.96% of the theoretical bandwidth and 99.23% of the experimental platform’s STREAM bandwidth.

HTML全文

参考文献(37)

施引文献

资源附件(0)