ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (4): 813-820.doi: 10.7544/issn1000-1239.2017.20160116

• 系统结构 • 上一篇    下一篇

众核处理器的流水线紧耦合指令循环缓存设计

张昆,过锋,郑方,谢向辉   

  1. (数学工程与先进计算国家重点实验室 江苏无锡 214125) (zhang.kun@meac-skl.cn)
  • 出版日期: 2017-04-01
  • 基金资助: 
    国家“八六三”高技术研究发展计划基金项目(2015AA01A301);国家自然科学基金项目(91430214)

Design of a Pipeline-Coupled Instruction Loop Cache for Many-Core Processors

Zhang Kun, Guo Feng, Zheng Fang, Xie Xianghui   

  1. (State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125)
  • Online: 2017-04-01

摘要: 能效比是未来高性能计算机需要解决的重要问题.众核处理器作为高性能计算机的重要实现手段,其微结构的优化设计对能效比提升尤为关键.提出了1种面向众核处理器的流水线紧耦合的指令循环缓存设计,以较小的L0指令缓存提供更加高能效的指令取指.作为体系结构研究同硬件可实现性紧密结合的1次尝试,设计始终考虑了硬件实现代价这一关键约束.为了控制L0指令缓存对流水线性能的影响,指令缓存采用了循环出口预取技术,以此保证指令缓存提供的低功耗的指令取指能够最终转化为流水线能效比的提升.在gem5模拟器上实现了对指令循环缓存的模拟.对SPEC2006的测试结果表明,在不影响流水线性能的前提下,设计的典型配置可以减少27%的指令取指功耗以及31.5%的流水线前段部件动态功耗.

关键词: 循环缓存, 众核处理器, 能效比, 指令缓存, 结构优化

Abstract: Energy efficiency is a great challenge in the design of future high performance computers. Since the many-core processor becomes a key choice of future high performance computers, the optimization of its micro-architecture is very important for the improvement of energy efficiency. This paper proposes a pipeline-coupled instruction loop cache for the many-core processor. The instruction loop cache is small sized so that it will provide more energy-efficient instruction storage. As an attempt of implementation-aware micro-architecture research, the loop cache is designed under constraints of hardware costs from the beginning. In order to alleviate the impact to the pipeline performance, the loop cache adopts a prefetching technique. The instruction loop cache prefetches the exit path of the loop into the cache when a loop is detected. The prefetching mechanism guarantees that the design of the loop cache in the pipeline can lead to the improvement of the energy efficiency. The instruction loop cache is implemented in the gem5 simulator. Experiments on a set of SPEC2006 benchmarks show that a typical configuration can reduce on average 27% of instruction fetching power and 31.5% power of the pipeline front-end.

Key words: loop cache, many-core processor, energy-efficiency, instruction cache, architecture optimization

中图分类号: