高级检索

    基于NTRU的格基密钥封装机制高效优化实现

    Efficient Software Implementations of NTRU Lattice-Based Key Encapsulation Mechanisms

    • 摘要: NTRU格是构建实用后量子格基密钥封装机制的重要选择. 格密码的软件优化工程实现对于后量子密码后续的应用部署具有重要意义. CTRU是中国学者提出的基于NTRU格的格密码密钥封装机制. 目前CTRU方案只有CTRU-768完成了C和AVX2实现,且实现有进一步的优化空间,并且CTRU-768的实现无法直接扩展到CTRU-512和CTRU-1024方案实现上. 完成了CTRU512和CTRU1 024及其变体CNTR-512和CNTR-1024首个的优化参考C实现和对应AVX2并行优化实现,并对已有的CTRU-768方案的参考实现和AVX2实现进行优化. 采用混合基数论变换(NTT)加速多项式环乘法,并使用Karatsuba算法加速分解后的小度数多项式环乘法. 此外,结合中心Barrett约减,提出在逆向NTT中进行基于索引的延迟约减. 对于CTRU-1024下较为耗时的多项式求逆,引入了Bernstein快速求逆算法. 进一步地,提供了更加高效的AVX2优化实现方案,利用Intel提出的单指令多数据(SIMD)指令集AVX2,加速了CTRU中的性能瓶颈. 采用层融合和系数置乱减少实现过程中的存取指令. 此外,对Bernstein快速多项式求逆算法进行了向量化优化实现. 对耗时SHA-3哈希模块进行AVX2汇编实现. 相较于最新的CTRU-768 AVX2实现,AVX2优化实现提升了8%~11%. 对于CTRU方案,与参考实现相比,AVX2优化实现在3个方案上的性能提升均非常显著. 对于CTRU方案,与参考实现相比,提出的AVX2优化实现在CTRU-512,CTRU-768,CTRU-1024这3个方案上的性能提升均十分显著,密钥生成、密钥封装、密钥解封装的提升幅度分别为56%~91%,74%~90%,70%~83%.

       

      Abstract: NTRU lattice is an important choice for building a practical post-quantum lattice-based key encapsulation mechanism. The software optimization engineering implementation of lattice cryptography is of great significance for the subsequent application deployment of post-quantum cryptography. CTRU is a lattice-based key encapsulation mechanism based on NTRU lattice proposed by Chinese scholars. At present, there only exists CTRU-768 scheme C and AVX2 implementation, and there is room for further optimization. In addition, the implementation of CTRU-768 cannot be directly extended to the CTRU-512 and CTRU-1024 schemes. This paper completes the first optimized reference C implementation of CTRU-512 and CTRU-1024 schemes and its variant CNTR-512 and CNTR-1024 with and the corresponding AVX2 parallel optimization implementation, and optimizes the existing CTRU-768 reference implementation and AVX2 implementation. It employs mixed radix number theoretic transformation (NTT) to accelerate polynomial multiplication and uses the Karatsuba algorithm to speed up the decomposition of low-degree polynomial ring multiplication. In addition, combined with the central Barrett reduction, this paper proposes index-based delay reduction in reverse NTT. For the time-consuming polynomial inversion under CTRU-1024 scheme, we employ the Bernstein fast inversion algorithm. Furthermore, this paper provides a more efficient AVX2 optimization implementation scheme, which uses the single instruction multiple data (SIMD) instruction set AVX2 proposed by Intel to accelerate the performance bottleneck in CTRU. This paper uses layer merging and coefficient permutation to reduce the load/store instructions. In addition, the Bernstein fast polynomial inversion algorithm is vectorized and optimized using AVX2. We also implement the time-consuming SHA-3 hash module in AVX2 assembly. Compared with the latest CTRU-768 scheme AVX2 implementation, the AVX2 optimized implementation in this paper improves by 8%-11%. For the CTRU scheme, compared with the reference implementation, the performance improvement of the AVX2 optimized implementation in this paper on three sets of parameters is significant. The key generation, key encapsulation, and key decapsulation improvements are 56%-91%, 74%-90%, and 70%-83% respectively.

       

    /

    返回文章
    返回