高级检索

    高效可扩展的3维快速傅里叶变换异构计算架构

    Efficient and Scalable 3D-FFT Heterogeneous Computing Architecture

    • 摘要: 3维快速傅里叶变换(three-dimensional fast Fourier transform, 3D-FFT)作为高性能计算领域的核心算法,其加速优化对提升科学计算应用性能具有重要价值。针对现有3D-FFT实现方案在数据访存效率、同步处理机制和并行计算能力方面的性能瓶颈,设计并实现了一种基于CPU + FPGA异构计算架构的优化方法。通过建立3维数据分块传输模型、构建流水线式同步数据转置单元以及开发层次化多核并行计算架构,系统性地改善了3D-FFT算法执行效率。实验验证所提出的异构计算架构其存储器数据传输带宽达到202 GBps,实现了128个FFT核的并行计算且计算规模可扩展。实验结果表明,在处理典型科学计算场景中643和1283这2种3维矩阵的3D-FFT运算时,相较于传统CPU实现方案,该架构分别取得了62.7%和56.6%的性能提升;横向对比现有FPGA加速方案,计算效率分别提高32.6%和35.3%;在大规模矩阵处理中与GPU加速方案的性能相当。

       

      Abstract: The three-dimensional fast Fourier transform (3D-FFT) algorithm serves as a fundamental computing kernel for numerous high performance computing (HPC) applications, where its efficient implementation critically determines the overall system performance. This paper proposes a heterogeneous computing architecture leveraging CPU-FPGA co-processing to address the performance bottlenecks in conventional 3D-FFT implementations, including inefficient memory access patterns, procedural redundancy, and limited parallelization capabilities. Three key architectural innovations are introduced: 1) a hierarchical data management strategy enabling multi-channel data transmission with optimized bandwidth utilization, 2) a pipelined synchronization mechanism that overlaps matrix transpose operations with FFT computation, and 3) a scalable parallel computation architecture supporting up to 128 concurrent FFT processing units. Experimental evaluations demonstrate significant performance improvements of 3D-FFT for common scientific computing scenarios involving 643 and 1283 matrices. Compared with CPU-only solutions, the proposed architecture achieves computation time reductions of 62.7% and 56.6% for 643 and 1283 matrices, respectively. When compared with existing FPGA-accelerated 3D-FFT implementations, the proposed design exhibits performance enhancements of 32.6% and 35.3% for the corresponding matrix sizes. These results validate our architecture's effectiveness in optimizing memory utilization, improving task synchronization efficiency and enhancing computational parallelism. The solution provides a scalable and energy-efficient acceleration architecture for HPC systems requiring large-scale 3D-FFT computations, particularly in scenarios demanding high throughput and low latency.

       

    /

    返回文章
    返回