高级检索
    雷元武, 陈小文, 彭元喜. DSP芯片中的高能效FFT加速器[J]. 计算机研究与发展, 2016, 53(7): 1438-1446. DOI: 10.7544/issn1000-1239.2016.20160123
    引用本文: 雷元武, 陈小文, 彭元喜. DSP芯片中的高能效FFT加速器[J]. 计算机研究与发展, 2016, 53(7): 1438-1446. DOI: 10.7544/issn1000-1239.2016.20160123
    Lei Yuanwu, Chen Xiaowen, Peng Yuanxi. A High Energy Efficiency FFT Accelerator on DSP Chip[J]. Journal of Computer Research and Development, 2016, 53(7): 1438-1446. DOI: 10.7544/issn1000-1239.2016.20160123
    Citation: Lei Yuanwu, Chen Xiaowen, Peng Yuanxi. A High Energy Efficiency FFT Accelerator on DSP Chip[J]. Journal of Computer Research and Development, 2016, 53(7): 1438-1446. DOI: 10.7544/issn1000-1239.2016.20160123

    DSP芯片中的高能效FFT加速器

    A High Energy Efficiency FFT Accelerator on DSP Chip

    • 摘要: 快速傅里叶变换(fast Fourier transform, FFT)是数字信号处理(digital signal processing, DSP)领域中最耗时的核心算法,该算法的计算性能和计算效率将影响整个应用的执行效率.因此,在DSP芯片上设计实现了一个基于矩阵转置操作的高能效可变长度FFT加速器,采用多种并行策略开发批量小规模FFT算法与大规模Cooley-Tukey FFT算法中指令级和任务级并行.设计“乒乓”多体数据存储器,重叠数据搬移和FFT计算之间的开销,提高FFT加速器计算效率.并基于此存储器,提出基于基本块的快速矩阵转置算法,从而避免对数据矩阵的列访问;提出混合旋转因子产生策略,结合查表和基于CORDIC算法在线计算方式,最大限度降低旋转因子产生的硬件开销.实验结果表明:FFT加速器原型的峰值能效为146 GFLOPs/W,相比Intel Xeon CPU上的多线程FFTW实现,取得2个数量级的能效提升.

       

      Abstract: Fast Fourier transform (FFT) is a most time-consuming algorithm in the domain of digital signal processing (DSP). The performance and energy efficiency of FFT will make significant effect on different DSP applications. Thus, this paper presents a high energy efficiency variable-size FFT accelerator based on matrix transposition on DSP chip. Several parallel schemes are employed to exploit instruction level parallel and task level parallel of batch of small-size FFTs or big-size Cooley-Tukey FFT. A “Ping-Pong” structure of multi-bank data memory (MBDM) is presented to overlap the overhead of data move and FFT calculation. Moreover, based on MBDM, fast matrix transposition algorithm with basic block transposition is presented to avoid the matrix access with column-wise and improve the utilization of DDR bandwidth. Hybrid twiddle factor generating scheme, combining lookup table and on-line calculation with CORDIC, is presented to reduce the hardware for twiddle factor. Experimental results show that our FFT accelerator prototype with power efficiency of 146 GFLOPs/W, achieves energy efficiency improvement by about two orders of magnitude with multi-thread FFTW on Intel Xeon CPU.

       

    /

    返回文章
    返回