基于超块的统一分簇与模调度

胡定磊  陈书明  刘春林

基于超块的统一分簇与模调度

胡定磊陈书明刘春林

Hyperblock-Based Unified Cluster Assignment and Modulo Scheduling

Hu Dinglei, Chen Shuming, and Liu Chunlin

摘要

摘要: 超长指令字处理器为了提高指令集并行(ILP)往往采用多个功能单元，从而需要多端口的寄存器文件提供支持.但是寄存器文件会随着端口的增多变得更复杂，频率难以提升，成为系统的瓶颈.分簇是解决这一问题的有效手段.分簇在不影响处理器ILP的前提下减少了每簇寄存器文件的端口数目，但对编译器提出了挑战，编译器必须将指令和操作数在簇间进行合理分配才能得到较好的指令级并行.针对分簇超长指令字结构提出了一种基于超块的统一分簇与模调度编译方法.使用超块技术可以增大调度范围以获得更好的ILP，并且可以处理含有控制流的循环体，增加了模调度的适用范围.超块中指令的分簇与模调度则是统一进行的，这将比分阶段进行有更好的优化效果，因为统一进行是从全局的角度寻求优化而非寻求各个阶段局部优化.在YHFT-DSP/700编译器中的实验结果表明，与ITSS算法相比，该算法可以达到较好的优化效果.

Abstract: In order to exploit instruction level parallelism (ILP), multiple functional units with multi-ports register file are often used in very long instruction word (VLIW) processor. As the number of functional units rises, the number of register file ports will grow accordingly. At some point, the multiplexing logic on register ports can come to dominate the processor's cycle time. A reasonable solution is to partition the register file into independent clusters. Although clustered architectures reduce register file ports per cluster without performance degradation, they present new challenges to compiler which must assign every operation and operand to a specific cluster and coordinate data movement between clusters to achieve fine ILP. In this paper, a scheduling algorithm for clustered VLIW architectures—hyperblock-based unified cluster assignment and modulo scheduling (HBUCAMS) is proposed. Compared with basic block, hyperblock can provide more larger schedule region for exploiting ILP. Furthermore, because loop bodies with control flow can be converted into hyperblocks, there are more opportunities to apply modulo scheduling. Instead of performing clustered assignment and modulo scheduling sequentially, HBUCAMS put them into a single phase. This unified approach is more effective than phase-ordered approach, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Experiments in YHFT-DSP/700 compiler show that the proposed algorithm can obtain more optimized result than the ITSS algorithm.

HTML全文

参考文献(0)

施引文献

资源附件(0)