基于迭代序的流程序局部性分析和优化
Locality Analysis and Optimization for Stream Programs Based on Iteration Sequence
-
摘要: 流编程模型是一种近年来被广泛研究的并行编程模型,它在基于软件管理的流式存储器,如流寄存器文件的流体系结构上得到了良好的应用.但同时也有研究指出流编程模型同样适合于基于硬件管理的一致性cache的体系结构.流编程模型目前最重要的应用背景GPGPU在发展中也逐渐引入通用的数据cache,因此发掘流程序的cache局部性就成为在这类体系结构上提高流程序性能的关键.由于流程序特殊的执行模型,其重用向局部性转化的过程与传统的串行程序不一致,无法直接使用传统的局部性分析方法直接对流程序进行分析.在深入分析了重用向局部性转化过程的基础上,提出了“迭代序”的概念用于描述流和串行程序重用向局部性转化时的不同,同时结合流程序的执行特点面向并行扩展了传统的局部性分析理论,给出了基于迭代序的局部性分析方法.此外,结合局部性分析模型还提出了两种流程序的cache局部性优化方法.在GPGPUSim模拟平台上进行的验证结果表明对流程序局部性的定量分析是有效的,并且提出的优化方法也可以有效改善流程序的cache局部性,提高流程序的性能.Abstract: The stream programming model has been widely studied in recent years, and it is proved to be a good candidate for the stream architecture based on software-managed streaming memory such as stream register file. However, some researches point out that the stream programming model is also beneficial to the hardware-managed coherent cache-based architectures. Besides, general data cache has been integrated into the GPGPU recently, which is the most important scenario where the stream programming model is used. Thus, exploiting the cache locality becomes very critical to improve the performance of the stream programs on these architectures. Due to the particular execution model, the way that the reuse carried by the stream programs is transformed into locality differs from that of serial programs. The traditional locality analysis method cannot be used to analyze the stream programs directly. Based on the detailed analysis of transformation from reuse to locality, we propose the concept of “iteration sequence” to capture the difference between the execution model of stream and serial programs. Then we extend the traditional locality analysis method and propose a general locality analysis method based on the iteration sequence. Further, we propose two locality optimization techniques derived from the locality analysis model. The experimental results obtained on the GPGPUSim simulator illustrate that our quantitative analysis about the locality of stream program is efficient and the locality optimization techniques can evidently improve the locality and the performance.