基于数据流架构的NTT蝶式计算加速

石泓博; 范志华; 李文明; 张志远; 穆宇栋; 叶笑春; 安学军

doi:10.7544/issn1000-1239.202550160

基于数据流架构的NTT蝶式计算加速

NTT Butterfly Arithmetic Acceleration Based on Dataflow Architecture

摘要

摘要: 全同态加密（fully homomorphic encryption，FHE）因其在计算全过程中保持数据加密的能力，为云计算等分布式环境中的隐私保护提供了重要支撑，具有广泛的应用前景. 然而，FHE在计算过程中普遍存在运算复杂度高、数据局部性差以及并行度受限等问题，导致其在实际应用中的性能严重受限. 其中，快速数论变换（number theoretic transform，NTT）作为FHE中关键的基础算子，其性能对整个系统的效率具有决定性影响. 针对NTT中的核心计算模式——蝶式（butterfly）计算，提出一种基于数据流计算模型的NTT加速架构. 首先，设计面向NTT蝶式计算的RVFHE扩展指令集，定制高效的模乘与模加/模减运算单元，以提升模运算处理效率. 其次，提出一种NTT数据重排方法，并结合结构化的蝶式地址生成策略，以降低跨行列数据交换的控制复杂度与访问冲突. 最后，设计融合数据流驱动机制的NTT加速架构，通过数据依赖触发方式实现高效的片上调度与数据复用，从而充分挖掘操作级并行性. 实验结果表明，与NVIDIA GPU相比，提出的架构获得了8.96倍的性能提升和8.53倍的能效提升；与现有的NTT加速器相比，所提架构获得了1.37倍的性能提升.

Abstract: Fully homomorphic encryption (FHE), which enables computation on encrypted data without decryption throughout the entire processing flow, offers a promising solution for privacy preservation in cloud computing and other distributed environments. However, the practical deployment of FHE remains significantly constrained by its high computational complexity, poor data locality, and limited parallelism. Among the core operations in FHE, the number theoretic transform (NTT) plays a pivotal role in determining overall system performance. We target the butterfly computation pattern, which is central to the NTT algorithm, and propose a high-efficiency NTT accelerator architecture based on a dataflow computing model. First, we design an RVFHE extension instruction set tailored for NTT butterfly operations, incorporating custom modular multiplication and modular addition/subtraction units to enhance the efficiency of modular arithmetic. Second, we introduce a novel NTT data reordering scheme, combined with a structured butterfly address generation strategy, to reduce the control complexity and access conflicts associated with cross-row and cross-column data exchanges. Finally, we develop a dataflow-driven NTT accelerator architecture that leverages data dependency-triggered execution to enable efficient on-chip scheduling and data reuse, thereby exploiting instruction-level parallelism to the fullest extent. Experimental results demonstrate that, compared with NVIDIA GPU, the proposed architecture achieves up to 8.96 times speedup and 8.53 times improvement in energy efficiency. Furthermore, compared with state-of-the-art dedicated NTT accelerators, our design delivers a 1.37 times performance gain.

HTML全文

参考文献(35)

施引文献

资源附件(1)