面向SIMD指令集的SM4算法比特切片优化

王闯; 丁滟; 黄辰林; 宋连涛

doi:10.7544/issn1000-1239.202220531

面向SIMD指令集的SM4算法比特切片优化

Bitsliced Optimization of SM4 Algorithm with the SIMD Instruction Set

摘要

摘要: SM4算法是中国自主设计的商用分组密码算法，其加解密计算性能成为影响信息系统数据机密性保障的重要因素之一. 现有SM4算法优化主要面向硬件设计和软件查表等方向展开研究，分别存在依赖特定硬件环境、效率低下且易遭受侧信道攻击等问题. 比特切片技术通过对输入数据重组实现了并行化高效分组密码处理，可以抵御针对缓存的侧信道攻击. 然而现有切片分组密码研究对硬件平台相关性强、处理器架构支持单一，并且并行化处理流水启动较慢，面向小规模数据的加解密操作难以充分发挥单指令多数据（single instruction multiple data，SIMD）等先进指令集的优势. 针对上述问题，首先提出了一种跨平台的通用切片分组密码算法模型，支持面向不同的处理器指令字长提供一致化的通用数据切片方法. 在此基础上，提出了一种面向SIMD指令集的细粒度切片并行处理SM4优化算法，通过细粒度明文切片重组与线性处理优化有效缩短算法启动时间. 实验结果表明，相比通用SM4算法，优化的SM4比特切片算法加密速率最高可达438.0 MBps，加密每字节所需的时钟周期最快高达7.0 CPB（cycle/B），加密性能平均提升80.4%~430.3%.

Abstract: SM4 algorithm is a commercial block cipher algorithm independently designed by China, and its encryption and decryption performance has become one of the critical factors affecting the data confidentiality of the information system. The existing optimizations mainly focus on hardware designs and software look-up tables, which have problems such as dependence on specific hardware environments, low efficiency, and vulnerability to side-channel attacks. Bit slicing technology efficiently processes block ciphers in parallel by reorganizing input data, and can resist side-channel attacks against caches. However, the existing researches on bitsliced block ciphers are highly dependent on the hardware platforms and only support a single processor architecture, and the parallel processing pipeline starts slowly. It is difficult for the encryption and decryption operations for small-scale data to give full play to the advantages of advanced instruction sets such as SIMD (single instruction multiple data) instructions. To resolve the above problems, we firstly propose a cross-platform general bitsliced block cipher algorithm model, which supports a general data slicing method that provides consistent data slicing for different processor instructions. Based on that, a fine-grained bitsliced SM4 optimization algorithm for SIMD instructions is proposed, which can effectively shorten the startup time of the algorithm through fine-grained plaintext slicing reorganization and linear transformation optimization. The experiments show that, compared with the look-up table-based SM4 algorithm, the encryption rate can reach up to 438.0 MBps. The clock cycles required for encrypting a byte are up to 7.0 CPB (cycle/B), and the encryption performance is improved by an average of 80.4% to 430.3%.

HTML全文

参考文献(21)

施引文献

资源附件(0)