面向飞腾迈创数字处理器的内核代码自动生成框架

赵宵磊; 陈照云; 时洋; 文梅; 张春元

doi:10.7544/issn1000-1239.202330058

摘要: 数字信号处理器（digital signal processor，DSP）通常采用超长指令字（very long instruction word，VLIW）和单指令多数据（single instruction multiple data，SIMD）的架构来提升处理器整体计算性能，从而适用于高性能计算、图像处理、嵌入式系统等各个领域. 飞腾迈创数字处理器（FT-Matrix）作为国防科技大学自主研制的高性能通用数字信号处理器，其极致计算性能的体现依赖于对VLIW与SIMD架构特点的充分挖掘. 不止是飞腾迈创系列，绝大多数处理器上高度优化的内核代码或核心库函数都依赖于底层汇编级工具或手工开发. 然而，手工编写内核算子的开发方法总是需要大量的时间和人力开销来充分释放硬件的性能潜力. 尤其是VLIW+SIMD的处理器，专家级汇编开发的难度更为突出. 针对这些问题，提出一种面向飞腾迈创数字处理器的高性能的内核代码自动生成框架（automatic kernel code-generation framework on FT-Matrix），将飞腾迈创处理器的架构特性引入到多层次的内核代码优化方法中. 该框架包括3层优化组件：自适应循环分块、标向量协同的自动向量化和细粒度的指令级优化. 该框架可以根据硬件的内存层次结构和内核的数据布局自动搜索最优循环分块参数，并进一步引入标量-向量单元协同的自动向量化指令选择与数据排布，以提高内核代码执行时的数据复用和并行性. 此外，该框架提供了类汇编的中间表示，以应用各种指令级优化来探索更多指令级并行性（ILP）的优化空间，同时也为其他硬件平台提供了后端快速接入和自适应代码生成的模块，以实现高效内核代码开发的敏捷设计. 实验表明，该框架生成的内核基准测试代码的平均性能是目标—数字信号处理器（DSP）——的手工函数库的3.25倍，是使用普通向量C语言编写的内核代码的20.62倍.

Abstract: Digital signal processors (DSPs) commonly adopt VLIW-SIMD architecture and facilitate cooperation between scalar and vector units. As a typical VLIW-SIMD DSP architecture, the extreme performance of FT-Matrix DSP relies on highly optimized kernels. However, hand-crafted methods for kernel operator development always suffer from heavy time and labor overhead to unleash the potential of DSP hardware. General-purpose compilers are suffering from poor portability or performance, struggling to explore optimization space aggressively. We propose a high-performance automatic kernel code-generation framework, which introduces the characteristics of FT-Matrix into hierarchical kernel optimizations. The framework has three optimization component layers: loop tiling, vectorization and instruction-level optimization, and can automatically search for optimal tile size according to memory hierarchy and data layout, and further introduce the vectorization with scalar-vector unit cooperation to improve data reuse and parallelism, while some optimization space on collaborating scalar and vector units for specific design in architectures by different vendors is overlooked. The performance of VLIW architecture is determined by instruction-level parallelism (ILP) to a great extent. Moreover, Pitaya provides the assembly intrinsic representation on FT-Matrix DSP to apply diverse instruction-level optimizations to explore more ILPs. Experiments show that kernels generated by Pitaya outperform these from target DSP libraries by 3.25 times and C vector intrinsic kernels by 20.62 times on average.

面向飞腾迈创数字处理器的内核代码自动生成框架

Kernel Code Automatic Generation Framework on FT-Matrix