A Dataflow Cache Processor Frontend Design

Liu Bingtao; Wang Da; Ye Xiaochun; Zhang Hao; Fan Dongrui; Zhang Zhimin

doi:10.7544/issn1000-1239.2016.20150317

Journal of Computer Research and Development > 2016 > 53(6): 1221-1237. > DOI: 10.7544/issn1000-1239.2016.20150317 CSTR: 32373.14.issn1000-1239.2016.20150317

Liu Bingtao, Wang Da, Ye Xiaochun, Zhang Hao, Fan Dongrui, Zhang Zhimin. A Dataflow Cache Processor Frontend Design[J]. Journal of Computer Research and Development, 2016, 53(6): 1221-1237. DOI: 10.7544/issn1000-1239.2016.20150317

Citation:

PDF (5801 KB)

A Dataflow Cache Processor Frontend Design

¹(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190)
²(University of Chinese Academy of Sciences, Beijing 100049)

More Information

Published Date: May 31, 2016

Graphical Abstract

Abstract

Abstract

In order to exploit both thread-level parallelism (TLP) and instruction-level parallelism (ILP) of programs, dynamic multi-core technique can reconfigure multiple small cores to a more powerful virtual core. Usually a virtual core is weaker than a native core with equivalent chip resource. One important reason is that the fetch, decode and rename frontend stages are hard to cooperate after reconfiguration because of their serialized processing nature. To solve this problem, we propose a new frontend design called the dataflow cache with a corresponding vector renaming (VR) mechanism. By caching and reusing the data dependencies and other information of the instruction basicblock, the dataflow cache exploits the dataflow locality of programs. Firstly, the processor core can exploit better instruction-level parallelism and lower branch misprediction penalty with dataflow cache; Secondly, the virtual core in dynamic multi-core can solve its frontend problem by using dataflow cache to bypass the traditional frontend stages. By experimenting on the SPEC CPU2006 programs, we prove that dataflow cache can cover 90% of the dynamic instructions with limited cost. Then, we analyze the performance effect of adding the dataflow cache to pipeline. At last, experiments show that with a frontend of 4-instruction wide and an instruction window of 512-entry, the performance of the virtual core with dataflow cache is improved up to 9.4% in average with a 28% maximum for some programs.
- processor microarchitecture,
- instruction cache (ICache),
- dataflow,
- instruction renaming,
- dataflow locality

FullText(HTML)

References (0)

[1]	Shi Hongbo, Fan Zhihua, Li Wenming, Zhang Zhiyuan, Mu Yudong, Ye Xiaochun, An Xuejun. NTT Butterfly Arithmetic Acceleration Based on Dataflow Architecture[J]. Journal of Computer Research and Development, 2025, 62(6): 1547-1561. DOI: 10.7544/issn1000-1239.202550160
[2]	Xiao Qian, Zhao Meijia, Li Mingfan, Shen Li, Chen Junshi, Zhou Wenhao, Wang Fei, An Hong. A Dataflow Computing System for New Generation of Domestic Heterogeneous Many-Core Processors[J]. Journal of Computer Research and Development, 2023, 60(10): 2405-2417. DOI: 10.7544/issn1000-1239.202220562
[3]	Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
[4]	Wu Xinxin, Ou Yan, Li Wenming, Wang Da, Zhang Hao, Fan Dongrui. Acceleration of Sparse Convolutional Neural Network Based on Coarse-Grained Dataflow Architecture[J]. Journal of Computer Research and Development, 2021, 58(7): 1504-1517. DOI: 10.7544/issn1000-1239.2021.20200112
[5]	Ou Yan, Feng Yujing, Li Wenming, Ye Xiaochun, Wang Da, Fan Dongrui. Optimum Research on Inner-Inst Memory Access Conflict for Dataflow Architecture[J]. Journal of Computer Research and Development, 2019, 56(12): 2720-2732. DOI: 10.7544/issn1000-1239.2019.20190115
[6]	Xiang Taoran, Ye Xiaochun, Li Wenming, Feng Yujing, Tan Xu, Zhang Hao, Fan Dongrui. Accelerating Fully Connected Layers of Sparse Neural Networks with Fine-Grained Dataflow Architectures[J]. Journal of Computer Research and Development, 2019, 56(6): 1192-1204. DOI: 10.7544/issn1000-1239.2019.20190117
[7]	Liu Bingtao, Wang Da, Ye Xiaochun, Fan Dongrui, Zhang Zhimin, Tang Zhimin. The Data-Flow Block Based Spatial Instruction Scheduling Method[J]. Journal of Computer Research and Development, 2017, 54(4): 750-763. DOI: 10.7544/issn1000-1239.2017.20160138
[8]	Tang Tao, Yang Xuejun, and Lin Yisong. Locality Analysis and Optimization for Stream Programs Based on Iteration Sequence[J]. Journal of Computer Research and Development, 2012, 49(6): 1363-1375.
[9]	Li Tao, Zhang Xiaoming, and Sun Zhigang. Coarse-Grained Dataflow Network Processor：Architecture and Prototype Design[J]. Journal of Computer Research and Development, 2009, 46(8): 1278-1284.
[10]	Ai Lihua and Luo Siwei. Study of Grid Locality and Its Optimization[J]. Journal of Computer Research and Development, 2008, 45(10): 1669-1675.