Abstract:
Tensor transposition, a tensor operation primitive, plays a vital role in signal processing, scientific computing, and deep learning, particularly in tensor applications and high-performance computing. With energy efficiency becoming increasingly critical in HPC systems, digital signal processor (DSP)-based accelerators have been integrated into general-purpose computing architectures. However, conventional tensor transposition libraries designed for multi-core CPUs and GPUs fail to fully adapt to DSP platforms due to architectural disparities. On one hand, the vectorization potential of DSP architectures remains underutilized; on the other hand, their abundant on-chip memory and hierarchical shared memory pose significant challenges for parallel tensor programming. We propose the ftmTT algorithm and implements a generic tensor transposition library tailored for multi-core DSP architectures. The core innovation of ftmTT lies in designing efficient memory access patterns while fully exploiting vectorization capabilities. Specifically, it introduces a tensor block memory access coalescing scheme optimized for DMA engine and develops a matrix transposition kernel adapted to DSP vectorization units. By decomposing tensor transposition into matrix transposition operations through blocking strategies and implementing double buffering through overlapping computation kernels with asynchronous DMA transfers, the algorithm achieves high-speed parallel tensor transposition on DSP platforms. Experimental evaluations on the FT-M7032 platform demonstrate that ftmTT attains up to 75.96% of the theoretical bandwidth for transposition operations, reaching 99.23% of the platform's STREAM benchmark bandwidth.