高级检索

    BE-HB:基于块浮点的混合位宽卷积处理单元

    BE-HB: A Hybrid Bit-Width Convolution Processing Unit Based on Block Floating Point

    • 摘要: 混合位宽的块浮点(block floating point,BFP)技术为低比特位宽卷积计算提供了一种灵活的解决方案,能有效优化存储效率和计算精度。通过为数值敏感层分配更高的位宽,而为冗余或稳定区域分配较低的位宽,该方法在显著降低计算与存储开销的同时,仍能保持接近浮点精度的计算表现。尽管现有的研究已经部署了基于混合位宽的BFP卷积硬件方案,如现场可编程门阵列(field-programmable gate array,FPGA)等,但这些方案往往忽视了数字信号处理器(digital signal processor,DSP)的计算潜力,导致未能充分利用FPGA硬件资源,具体而言,DSP的利用不足往往导致不必要的资源浪费和计算吞吐量受限,从而限制了基于FPGA的BFP卷积加速器的整体性能。提出一种新颖的基于FPGA的BFP卷积处理单元,称为“BE-HB”,该设计利用单个DSP在双模式位宽(即8 b或16 b)下耦合2组BFP卷积计算,以实现高性能计算。接着,提供了一种映射方法,通过在计算中复用块浮点打包后的共享指数与私有尾数,实现减少在DSP内8 b或16 b数据宽度下2组BFP卷积计算的硬件资源开销。与具有代表性的基线方法相比,所提出的设计在保持模型精度的前提下,计算所需的LUT资源消耗平均降低了61.4%,实现了更好的性能和更低的资源消耗,使其更适合资源受限的基于FPGA的边缘计算平台。

       

      Abstract: Hybrid bit-width block floating point (BFP) offers a flexible solution for low bit-width convolution computations, optimizing storage efficiency and computational precision. By assigning higher bit-widths to numerically sensitive layers while using lower bit-widths for redundant or stable regions, this approach preserves near-floating-point accuracy with substantially reduced computational and storage cost. Recent researches have deployed hardware solutions such as field programmable gate arrays (FPGAs) for hybrid bit-width BFP-based convolution accelerations, but they tend to underutilize FPGA resources by overlooking the full potential of digital signal processors (DSPs). Specifically, the underutilization of DSPs often leads to unnecessary resource waste and limited computational throughput, which restricts the overall performance of FPGA-based BFP convolution accelerators. This work develops a novel FPGA-based BFP convolution processing unit, termed “BE-HB”, capable of coupling two sets of BFP convolution calculations in dual-mode bit-width (i.e., 8 b or 16 b) using a single DSP for high performance. We then introduce a novel mapping method that reuses the shared exponents and private mantissas of BFP representations to perform two sets of BFP convolution computations within 8 b or 16 b DSP data paths. By leveraging the exponent sharing, data packing and data reuse, the proposed approach significantly reduces hardware resource overhead. Compared with representative baseline designs, the proposed design achieves an average reduction of 61.4% in LUT utilization while maintaining model accuracy, thereby delivering superior performance and resource efficiency, which makes it more suitable for resource-constrained FPGA-based edge computing platforms.

       

    /

    返回文章
    返回