高级检索

    高效可扩展的三维快速傅里叶变换异构计算架构

    Efficient and scalable 3D-FFT heterogeneous computing architecture

    • 摘要: 三维快速傅里叶变换(three-dimensional fast Fourier transform, 3D-FFT)作为高性能计算领域的核心算法,其加速优化对提升科学计算应用性能具有重要研究价值.针对现有3D-FFT实现方案在数据访存效率、同步处理机制和并行计算能力方面的性能瓶颈,本研究设计并实现了一种基于CPU + FPGA异构计算架构的优化方法.通过建立三维数据分块传输模型、构建流水线式同步转置单元以及开发层次化多核并行计算架构,系统性地改善了3D-FFT算法执行效率.实验验证所提出的异构架构其存储器数据传输带宽达到202 GB/s,实现了128个FFT核的并行计算且计算规模可扩展.实验结果表明,在处理典型科学计算场景中64×64×64和128×128×128三维矩阵的3D-FFT运算时,相较于传统CPU实现方案,本架构分别取得了69.8%和57.5%的性能提升;横向对比现有FPGA加速方案,计算效率分别提高32.6%和35.3%;在大规模矩阵处理中与GPU加速方案的性能相当.

       

      Abstract: The three-dimensional fast Fourier transform (3D-FFT) algorithm serves as a fundamental computational kernel for numerous high-performance computing (HPC) applications, where its efficient implementation critically determines the overall system performance. This paper proposes a heterogeneous computing architecture leveraging CPU-FPGA co-processing to address the performance bottlenecks in conventional 3D-FFT implementations, including inefficient memory access patterns, procedural redundancy, and limited parallelization capabilities. Our methodology introduces three key architectural innovations: (1) a hierarchical data management strategy enabling multi-channel data transmission with optimized bandwidth utilization, (2) a pipelined synchronization mechanism that overlaps matrix transposition with FFT computation phases, and (3) a scalable parallel computation architecture supporting up to 128 concurrent FFT processing units. Experimental evaluation demonstrates significant performance improvements of 3D-FFT for common scientific computing scenarios involving 64×64×64 and 128×128×128 matrices. Compared with CPU-only solutions, our architecture achieves computation time reductions of 69.8% and 57.5% for 64×64×64 and 128×128×128 matrices respectively. When compared with existing FPGA-accelerated 3D-FFT implementations, the proposed design exhibits 32.6% and 35.3% performance enhancements for the corresponding matrix sizes. These results validate our architecture's effectiveness in optimizing memory hierarchy utilization, improving task synchronization efficiency and enhancing computational parallelism. The solution provides a scalable and energy-efficient acceleration architecture for HPC systems requiring large-scale 3D-FFT computations, particularly in scenarios demanding high throughput and low latency.

       

    /

    返回文章
    返回