高级检索

    基于数据流架构的NTT蝶式计算加速

    NTT Butterfly Arithmetic Acceleration Based on Dataflow Architecture

    • 摘要: 全同态加密(fully homomorphic encryption,FHE)因其在计算全过程中保持数据加密的能力,为云计算等分布式环境中的隐私保护提供了重要支撑,具有广泛的应用前景. 然而,FHE在计算过程中普遍存在运算复杂度高、数据局部性差以及并行度受限等问题,导致其在实际应用中的性能严重受限. 其中,快速数论变换(number theoretic transform,NTT)作为FHE中关键的基础算子,其性能对整个系统的效率具有决定性影响. 本文针对NTT中的核心计算模式——蝶式(Butterfly)计算,提出一种基于数据流计算模型的NTT加速架构. 首先,设计面向NTT蝶式计算的RVFHE扩展指令集,定制高效的模乘与模加/模减运算单元,以提升模运算处理效率. 其次,提出一种NTT数据重排方法,并结合结构化的蝶式地址生成策略,以降低跨行列数据交换的控制复杂度与访问冲突. 最后,设计融合数据流驱动机制的NTT加速架构,通过数据依赖触发方式实现高效的片上调度与数据复用,从而充分挖掘操作级并行性. 实验结果表明,与NVIDIA GPU相比,本文提出的架构获得了8.96倍的性能提升和8.53倍的能效提升;与现有的NTT加速器相比,本文提出的架构获得了1.37倍的性能提升.

       

      Abstract: Fully homomorphic encryption (FHE), which enables computation on encrypted data without decryption throughout the entire processing flow, offers a promising solution for privacy preservation in cloud computing and other distributed environments. However, the practical deployment of FHE remains significantly constrained by its high computational complexity, poor data locality, and limited parallelism. Among the core operations in FHE, the number theoretic transform (NTT) plays a pivotal role in determining overall system performance. This paper targets the butterfly computation pattern, which is central to the NTT algorithm, and proposes a high-efficiency NTT accelerator architecture based on a dataflow computing model. First, we design an RVFHE extension instruction set tailored for NTT butterfly operations, incorporating custom modular multiplication and modular addition/subtraction units to enhance the efficiency of modular arithmetic. Second, we introduce a novel NTT data reordering scheme, combined with a structured butterfly address generation strategy, to reduce the control complexity and access conflicts associated with cross-row and cross-column data exchanges. Finally, we develop a dataflow-driven NTT accelerator architecture that leverages data dependency-triggered execution to enable efficient on-chip scheduling and data reuse, thereby exploiting instruction-level parallelism to the fullest extent. Experimental results demonstrate that, compared to NVIDIA GPU, the proposed architecture achieves up to 8.96× speedup and 8.53× improvement in energy efficiency. Furthermore, compared to state-of-the-art dedicated NTT accelerators, our design delivers a 1.37× performance gain.

       

    /

    返回文章
    返回