Efficient and Scalable 3D-FFT Heterogeneous Computing Architecture
-
Graphical Abstract
-
Abstract
The three-dimensional fast Fourier transform (3D-FFT) algorithm serves as a fundamental computing kernel for numerous high performance computing (HPC) applications, where its efficient implementation critically determines the overall system performance. This paper proposes a heterogeneous computing architecture leveraging CPU-FPGA co-processing to address the performance bottlenecks in conventional 3D-FFT implementations, including inefficient memory access patterns, procedural redundancy, and limited parallelization capabilities. Three key architectural innovations are introduced: 1) a hierarchical data management strategy enabling multi-channel data transmission with optimized bandwidth utilization, 2) a pipelined synchronization mechanism that overlaps matrix transpose operations with FFT computation, and 3) a scalable parallel computation architecture supporting up to 128 concurrent FFT processing units. Experimental evaluations demonstrate significant performance improvements of 3D-FFT for common scientific computing scenarios involving 643 and 1283 matrices. Compared with CPU-only solutions, the proposed architecture achieves computation time reductions of 62.7% and 56.6% for 643 and 1283 matrices, respectively. When compared with existing FPGA-accelerated 3D-FFT implementations, the proposed design exhibits performance enhancements of 32.6% and 35.3% for the corresponding matrix sizes. These results validate our architecture's effectiveness in optimizing memory utilization, improving task synchronization efficiency and enhancing computational parallelism. The solution provides a scalable and energy-efficient acceleration architecture for HPC systems requiring large-scale 3D-FFT computations, particularly in scenarios demanding high throughput and low latency.
-
-