Abstract:
Digital signal processors (DSPs) commonly adopt VLIW-SIMD architecture and facilitate cooperation between scalar and vector units. As a typical VLIW-SIMD DSP architecture, the extreme performance of FT-Matrix DSP relies on highly optimized kernels. However, hand-crafted methods for kernel operator development always suffer from heavy time and labor overhead to unleash the potential of DSP hardware. General-purpose compilers are suffering from poor portability or performance, struggling to explore optimization space aggressively. We propose a high-performance automatic kernel code-generation framework, which introduces the characteristics of FT-Matrix into hierarchical kernel optimizations. The framework has three optimization component layers: loop tiling, vectorization and instruction-level optimization, and can automatically search for optimal tile size according to memory hierarchy and data layout, and further introduce the vectorization with scalar-vector unit cooperation to improve data reuse and parallelism, while some optimization space on collaborating scalar and vector units for specific design in architectures by different vendors is overlooked. The performance of VLIW architecture is determined by instruction-level parallelism (ILP) to a great extent. Moreover, Pitaya provides the assembly intrinsic representation on FT-Matrix DSP to apply diverse instruction-level optimizations to explore more ILPs. Experiments show that kernels generated by Pitaya outperform these from target DSP libraries by 3.25 times and C vector intrinsic kernels by 20.62 times on average.