Parallel Optimization Technology of Tensor Transposition for Domestic Multi-Core DSPs
-
Graphical Abstract
-
Abstract
Tensor transposition is a fundamental computational primitive extensively employed in signal processing, scientific computing, and deep learning applications. With the growing emphasis on energy efficiency in high-performance computing, digital signal processors (DSPs) have emerged as promising accelerators. However, existing tensor transposition libraries designed for conventional CPU/GPU architectures fail to fully exploit the architectural advantages of modern DSP platforms, particularly their specialized memory hierarchies and vector processing capabilities. To address the architectural characteristics of domestic DSP platforms, we proposes ftmTT, a high-performance tensor transposition algorithm specifically optimized for multi-core DSP architectures. The key technical contributions include: 1) An intelligent tiling strategy that systematically decomposes high-dimensional tensor operations into efficient matrix transposition kernels native to DSP hardware; 2) A novel DMA-based data access optimization scheme that significantly reduces memory transfer overhead through smart data block consolidation; 3) A double-buffering design that asynchronously overlaps transposition computation with DMA transfers, enabling computation-communication overlap and ultimately achieving high-performance parallel tensor transposition on DSP platforms. Experiments on the domestic multi-core DSP platform FT-M7032 demonstrate that the ftmTT tensor transposition algorithm achieves up to 75.96% of the theoretical bandwidth and 99.23% of the experimental platform’s STREAM bandwidth.
-
-