ISSN 1000-1239 CN 11-1777/TP

• 论文 • 上一篇    下一篇

一种类数据流驱动的分片式流处理器体系结构及其编程模型

徐光 安虹 许牧 刘谷 姚平 任永青 汪芳   

  1. (中国科学技术大学计算机科学与技术学院 合肥 230027) (中国科学院计算机系统结构重点实验室(中国科学院计算技术研究所) 北京 100190) (xuguang5@mail.ustc.edu.cn)
  • 出版日期: 2010-09-15

The Architecture and the Programming Model of a Data-Flow-Like Driven Tiled Stream Processor

Xu Guang, An Hong, Xu Mu, Liu Gu, Yao Ping, Ren Yongqing, and Wang Fang   

  1. (Institute of Computer Science and Technology, University of Science and Technology of China, Hefei 230027) (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Science, Beijing 100190)
  • Online: 2010-09-15

摘要: 考虑到半导体工艺发展带来的线延迟问题,分布式、分片式的处理器结构变得很有吸引力.在传统流处理器中,流控制器发射的控制信号在传递时存在长线延迟问题.传统流处理器的运算簇由众多的功能部件组成,由于运算簇间的通信是集中控制的,运算簇间通信网络的线延迟可扩展性差.提出了一种分片式流处理器(TPA-PD)体系结构,它采用分布式的网络连接分片式的部件,避免了控制信号在传递过程中出现的长线延迟问题.在kernel级,TPA-PD使用类数据流的执行模型即显式数据流图执行,将指令间的依赖关系在指令中静态编码,把传统流处理器中运算簇间的集中通信变为动态发射、分布式的通信,利于结构扩展.解释了新的执行模型、指令集以及将流编程模型映射到新结构上.在时钟精确的模拟器上,实验分析了影响kernel级执行时间的软硬件因素,TPA-PD比传统流处理器在8个benchmark中平均获得了20%的加速比.

关键词: 线延迟, 流处理器, 分片式, 类数据流驱动, 处理器结构

Abstract: In the view of wire delay increase brought by technology development, the distributed and tiled processor architecture becomes increasingly attractive. The controlling signal dispatched by the stream controller of the conventional stream processor faces the increasing wire delay. The cluster consists of a variety of functional units in the conventional stream processor. The wire delay scalability of the centralized communication architecture among clusters is improper. In this paper, a tiled architecture of the stream processor (TPA-PD) is introduced, in which the distributed network is used to connect the tiled components to address the increasing wire delay of the controlling signal. A data-flow-like driven execution model, which is explicit data graph execution, is employed in the kernel level, the dependence relation is encoded in the instruction set, and the centralized communication model of clusters is converted into dynamic dispatching and distributed communication model which is wire-delay scalable. The instruction set, and how to map the stream programming model to the TPAD-PD and microarchitecture are described. Finally, the authors analyze the factor which has an effect on the kernel level execution time on a cycle-accurate simulator, and the TPA-PD achieves an average 20% speedup over traditional stream processor in eight benchmarks.

Key words: wire delay, stream processor, tiled, data-flow-like driven, architecture