BE-HB：基于块浮点的混合位宽卷积处理单元

李东阳; 李泽洋; 唐忆滨; 刘博生; 武继刚

doi:10.7544/issn1000-1239.202550645

BE-HB：基于块浮点的混合位宽卷积处理单元

BE-HB: A Hybrid Bit-Width Convolution Processing Unit Based on Block Floating Point

摘要

摘要: 混合位宽的块浮点（block floating point，BFP）技术为低比特位宽卷积计算提供了一种灵活的解决方案，能有效优化存储效率和计算精度。通过为数值敏感层分配更高的位宽，而为冗余或稳定区域分配较低的位宽，该方法在显著降低计算与存储开销的同时，仍能保持接近浮点精度的计算表现。尽管现有的研究已经部署了基于混合位宽的BFP卷积硬件方案，如现场可编程门阵列（field-programmable gate array，FPGA）等，但这些方案往往忽视了数字信号处理器（digital signal processor，DSP）的计算潜力，导致未能充分利用FPGA硬件资源。提出一种新颖的基于FPGA的BFP卷积处理单元，称为“BE-HB”，该设计利用单个DSP在双模式位宽（即8b或16b）下耦合2组BFP卷积计算，以实现高性能计算。接着，提供了一种映射方法，通过在计算中复用块浮点打包后的共享指数与私有尾数，实现减少在DSP内8b或16b数据宽度下2组BFP卷积计算的硬件资源开销。与具有代表性的基线方法相比，所提出的设计在保持模型精度的前提下，计算所需的LUT资源消耗平均降低了61.4%，实现了更好的性能和更低的资源消耗。

Abstract: Hybrid bit-width block floating point (BFP) offers a flexible solution for low bit-width convolution computations, optimizing storage efficiency and computational precision. By assigning higher bit-widths to numerically sensitive layers while using lower bit-widths for redundant or stable regions, this approach preserves near–floating-point accuracy with substantially reduced computational and storage cost. Recent researches have deployed hardware solutions such as field programmable gate arrays (FPGAs) for hybrid bit-width BFP-based convolution accelerations, but they tend to underutilize FPGA resources by overlooking the full potential of digital signal processors (DSPs). This work develops a novel FPGA-based BFP convolution processing unit, termed “BE-HB”, capable of coupling two sets of BFP convolution calculations in dual-mode bit-width (i.e., 8 or 16b) using a single DSP for high performance. We then introduce a novel mapping method that reuses the shared exponents and private mantissas of BFP representations to perform two sets of BFP convolution computations within 8b or 16b DSP data paths. By leveraging the exponent sharing, data packing and data reuse, the proposed approach significantly reduces hardware resource overhead. Compared with representative baseline designs, the proposed design achieves an average reduction of 61.4% in LUT utilization while maintaining model accuracy, thereby delivering superior performance and resource efficiency.

HTML全文

参考文献(50)

施引文献

资源附件(0)