ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (6): 1221-1237.doi: 10.7544/issn1000-1239.2016.20150317

• 系统结构 • 上一篇    下一篇

一种缓存数据流信息的处理器前端设计

刘炳涛1,2,王达1,叶笑春1,张浩1,范东睿1,张志敏1   

  1. 1(中国科学院计算技术研究所 北京 100190);2(中国科学院大学 北京 100049) (liubingtao@ict.ac.cn)
  • 出版日期: 2016-06-01
  • 基金资助: 
    国家“九七三”重点基础研究发展计划基金项目(2011CB302501);国家“八六三”高技术研究发展计划基金项目(2015AA011204,2012AA010901);“核高基”国家科技重大专项基金项目(2013ZX0102-8001-001-001);国家自然科学基金重点项目(61332009,61173007)

A Dataflow Cache Processor Frontend Design

Liu Bingtao1,2, Wang Da1, Ye Xiaochun1, Zhang Hao1, Fan Dongrui1, Zhang Zhimin1   

  1. 1(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190);2(University of Chinese Academy of Sciences, Beijing 100049)
  • Online: 2016-06-01

摘要: 为了能够同时发掘程序的线程级并行性和指令级并行性,动态多核技术通过将数个小核重构为一个较强的虚拟核来适应程序多样的需求.通常这种虚拟核性能弱于占有等量芯片资源的原生核,一个重要的原因就是取指、译码和重命名等流水线的前端各阶段具有串行处理的特征较难经重构后协同工作.为解决此问题,提出了新的前端结构——数据流缓存,并给出与之配合的向量重命名机制.数据流缓存利用程序的数据流局部性,存储并重用指令基本块内的数据依赖等信息.处理器核利用数据流缓存能更好地发掘程序的指令级并行性并降低分支预测错误的惩罚,而动态多核技术中的虚拟核通过使用数据流缓存旁路传统的流水线前端各阶段,其前端难协同工作的问题得以解决.对SPEC CPU2006中程序的实验证明了数据流缓存能够以有限代价覆盖大部分程序超过90%的动态指令,然后分析了添加数据流缓存对流水线性能的影响.实验证明,在前端宽度为4条指令、指令窗口容量为512的配置下,采用数据流缓存的虚拟核性能平均提升9.4%,某些程序性能提升高达28%.

关键词: 处理器微结构, 指令缓存, 数据流, 指令重命名, 数据流局部性

Abstract: In order to exploit both thread-level parallelism (TLP) and instruction-level parallelism (ILP) of programs, dynamic multi-core technique can reconfigure multiple small cores to a more powerful virtual core. Usually a virtual core is weaker than a native core with equivalent chip resource. One important reason is that the fetch, decode and rename frontend stages are hard to cooperate after reconfiguration because of their serialized processing nature. To solve this problem, we propose a new frontend design called the dataflow cache with a corresponding vector renaming (VR) mechanism. By caching and reusing the data dependencies and other information of the instruction basicblock, the dataflow cache exploits the dataflow locality of programs. Firstly, the processor core can exploit better instruction-level parallelism and lower branch misprediction penalty with dataflow cache; Secondly, the virtual core in dynamic multi-core can solve its frontend problem by using dataflow cache to bypass the traditional frontend stages. By experimenting on the SPEC CPU2006 programs, we prove that dataflow cache can cover 90% of the dynamic instructions with limited cost. Then, we analyze the performance effect of adding the dataflow cache to pipeline. At last, experiments show that with a frontend of 4-instruction wide and an instruction window of 512-entry, the performance of the virtual core with dataflow cache is improved up to 9.4% in average with a 28% maximum for some programs.

Key words: processor microarchitecture, instruction cache (ICache), dataflow, instruction renaming, dataflow locality

中图分类号: