Abstract:
NTRU lattice is an important choice for building a practical post-quantum lattice-based key encapsulation mechanism. The software optimization engineering implementation of lattice cryptography is of great significance for the subsequent application deployment of post-quantum cryptography. CTRU is a lattice-based key encapsulation mechanism based on NTRU lattice proposed by Chinese scholars. At present, there only exists CTRU-768 scheme C and AVX2 implementation, and there is room for further optimization. In addition, the implementation of CTRU-768 cannot be directly extended to the CTRU-512 and CTRU-
1024 schemes. This paper completes the first optimized reference C implementation of CTRU-512 and CTRU-
1024 schemes and its variant CNTR-512 and CNTR-
1024 with and the corresponding AVX2 parallel optimization implementation, and optimizes the existing CTRU-768 reference implementation and AVX2 implementation. It employs mixed radix number theoretic transformation (NTT) to accelerate polynomial multiplication and uses the Karatsuba algorithm to speed up the decomposition of low-degree polynomial ring multiplication. In addition, combined with the central Barrett reduction, this paper proposes index-based delay reduction in reverse NTT. For the time-consuming polynomial inversion under CTRU-
1024 scheme, we employ the Bernstein fast inversion algorithm. Furthermore, this paper provides a more efficient AVX2 optimization implementation scheme, which uses the single instruction multiple data (SIMD) instruction set AVX2 proposed by Intel to accelerate the performance bottleneck in CTRU. This paper uses layer merging and coefficient permutation to reduce the load/store instructions. In addition, the Bernstein fast polynomial inversion algorithm is vectorized and optimized using AVX2. We also implement the time-consuming SHA-3 hash module in AVX2 assembly. Compared with the latest CTRU-768 scheme AVX2 implementation, the AVX2 optimized implementation in this paper improves by 8%-11%. For the CTRU scheme, compared with the reference implementation, the performance improvement of the AVX2 optimized implementation in this paper on three sets of parameters is significant. The key generation, key encapsulation, and key decapsulation improvements are 56%-91%, 74%-90%, and 70%-83% respectively.