高效可扩展的3维快速傅里叶变换异构计算架构

邓子为; 郭巍; 徐亚明; 王青; 张德闪; 卢圣才; 刘弢

doi:10.7544/issn1000-1239.202550393

高效可扩展的3维快速傅里叶变换异构计算架构

Efficient and Scalable 3D-FFT Heterogeneous Computing Architecture

摘要

摘要: 3维快速傅里叶变换（three-dimensional fast Fourier transform, 3D-FFT）作为高性能计算领域的核心算法，其加速优化对提升科学计算应用性能具有重要价值。针对现有3D-FFT实现方案在数据访存效率、同步处理机制和并行计算能力方面的性能瓶颈，设计并实现了一种基于CPU + FPGA异构计算架构的优化方法。通过建立3维数据分块传输模型、构建流水线式同步数据转置单元以及开发层次化多核并行计算架构，系统性地改善了3D-FFT算法执行效率。实验验证所提出的异构计算架构其存储器数据传输带宽达到202 GBps，实现了128个FFT核的并行计算且计算规模可扩展。实验结果表明，在处理典型科学计算场景中64³和128³这2种3维矩阵的3D-FFT运算时，相较于传统CPU实现方案，该架构分别取得了62.7%和56.6%的性能提升；横向对比现有FPGA加速方案，计算效率分别提高32.6%和35.3%；在大规模矩阵处理中与GPU加速方案的性能相当。

Abstract: The three-dimensional fast Fourier transform (3D-FFT) algorithm serves as a fundamental computing kernel for numerous high performance computing (HPC) applications, where its efficient implementation critically determines the overall system performance. This paper proposes a heterogeneous computing architecture leveraging CPU-FPGA co-processing to address the performance bottlenecks in conventional 3D-FFT implementations, including inefficient memory access patterns, procedural redundancy, and limited parallelization capabilities. Three key architectural innovations are introduced: 1) a hierarchical data management strategy enabling multi-channel data transmission with optimized bandwidth utilization, 2) a pipelined synchronization mechanism that overlaps matrix transpose operations with FFT computation, and 3) a scalable parallel computation architecture supporting up to 128 concurrent FFT processing units. Experimental evaluations demonstrate significant performance improvements of 3D-FFT for common scientific computing scenarios involving 64³ and 128³ matrices. Compared with CPU-only solutions, the proposed architecture achieves computation time reductions of 62.7% and 56.6% for 64³ and 128³ matrices, respectively. When compared with existing FPGA-accelerated 3D-FFT implementations, the proposed design exhibits performance enhancements of 32.6% and 35.3% for the corresponding matrix sizes. These results validate our architecture's effectiveness in optimizing memory utilization, improving task synchronization efficiency and enhancing computational parallelism. The solution provides a scalable and energy-efficient acceleration architecture for HPC systems requiring large-scale 3D-FFT computations, particularly in scenarios demanding high throughput and low latency.

HTML全文

参考文献(21)

施引文献

资源附件(0)