高级检索

    面向国产多核DSP的张量转置并行优化技术研究

    Parallel Optimization of Tensor Transposition for Multi-core DSPs

    • 摘要: 张量转置(Tensor Transposition)作为基础张量运算原语,广泛应用于信号处理、科学计算以及深度学习等各种领域,在张量数据密集型应用及高性能计算中具有重要作用。随着能效指标在高性能计算系统中的重要性日益凸显,基于数字信号处理器(DSP)的加速器已被集成至通用计算系统。然而,传统面向多核CPU和GPU的张量转置库因架构差异无法充分适配DSP平台。一方面,DSP平台的向量化计算潜力尚未得到充分挖掘;另一方面,其复杂的片上存储体系与多层次共享内存结构为张量并行程序设计带来了显著挑战。针对国产DSP平台的架构特点,本文提出ftmTT算法,并设计实现了一个面向多核DSP架构的通用张量转置库。ftmTT算法通过设计适配DSP架构的高效内存访问模式充分挖掘其并行化和向量化潜力,其核心创新包括:(1)采用分块策略将高维张量转置转化为DSP平台所提供的矩阵转置内核操作;(2)提出基于DMA点对点传输的张量数据块访存合并方案,降低数据搬运开销;(3)通过双缓冲设计异步重叠转置计算与DMA传输,实现计算通信隐藏,最终在DSP平台上实现高性能并行张量转置。在国产多核DSP平台FT-M7032的实验表明,ftmTT张量转置算法取得了最高达理论带宽75.96%的性能,达到FT-M7032平台STREAM带宽的99.23%.

       

      Abstract: Tensor transposition, a tensor operation primitive, plays a vital role in signal processing, scientific computing, and deep learning, particularly in tensor applications and high-performance computing. With energy efficiency becoming increasingly critical in HPC systems, digital signal processor (DSP)-based accelerators have been integrated into general-purpose computing architectures. However, conventional tensor transposition libraries designed for multi-core CPUs and GPUs fail to fully adapt to DSP platforms due to architectural disparities. On one hand, the vectorization potential of DSP architectures remains underutilized; on the other hand, their abundant on-chip memory and hierarchical shared memory pose significant challenges for parallel tensor programming. We propose the ftmTT algorithm and implements a generic tensor transposition library tailored for multi-core DSP architectures. The core innovation of ftmTT lies in designing efficient memory access patterns while fully exploiting vectorization capabilities. Specifically, it introduces a tensor block memory access coalescing scheme optimized for DMA engine and develops a matrix transposition kernel adapted to DSP vectorization units. By decomposing tensor transposition into matrix transposition operations through blocking strategies and implementing double buffering through overlapping computation kernels with asynchronous DMA transfers, the algorithm achieves high-speed parallel tensor transposition on DSP platforms. Experimental evaluations on the FT-M7032 platform demonstrate that ftmTT attains up to 75.96% of the theoretical bandwidth for transposition operations, reaching 99.23% of the platform's STREAM benchmark bandwidth.

       

    /

    返回文章
    返回