Parallel Optimization Technology of Tensor Transposition for Domestic Multi-Core DSPs

Liu Gencheng; Wang Qinglin; Hong Chuhe; Peng Xing; Xia Rui; Liang Yaling; Zhang Qingyang; Che Yonggang; Liu Jie

doi:10.7544/issn1000-1239.202550130

Liu Gencheng, Wang Qinglin, Hong Chuhe, Peng Xing, Xia Rui, Liang Yaling, Zhang Qingyang, Che Yonggang, Liu Jie. Parallel Optimization Technology of Tensor Transposition for Domestic Multi-Core DSPs[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550130

Citation:

Parallel Optimization Technology of Tensor Transposition for Domestic Multi-Core DSPs

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Tensor transposition is a fundamental computational primitive extensively employed in signal processing, scientific computing, and deep learning applications. With the growing emphasis on energy efficiency in high-performance computing, digital signal processors (DSPs) have emerged as promising accelerators. However, existing tensor transposition libraries designed for conventional CPU/GPU architectures fail to fully exploit the architectural advantages of modern DSP platforms, particularly their specialized memory hierarchies and vector processing capabilities. To address the architectural characteristics of domestic DSP platforms, we proposes ftmTT, a high-performance tensor transposition algorithm specifically optimized for multi-core DSP architectures. The key technical contributions include: 1) An intelligent tiling strategy that systematically decomposes high-dimensional tensor operations into efficient matrix transposition kernels native to DSP hardware; 2) A novel DMA-based data access optimization scheme that significantly reduces memory transfer overhead through smart data block consolidation; 3) A double-buffering design that asynchronously overlaps transposition computation with DMA transfers, enabling computation-communication overlap and ultimately achieving high-performance parallel tensor transposition on DSP platforms. Experiments on the domestic multi-core DSP platform FT-M7032 demonstrate that the ftmTT tensor transposition algorithm achieves up to 75.96% of the theoretical bandwidth and 99.23% of the experimental platform’s STREAM bandwidth.

FullText(HTML)

References (37)

Cited By

Turn off MathJax

Article Contents

Parallel Optimization Technology of Tensor Transposition for Domestic Multi-Core DSPs

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content