ISSN 1000-1239 CN 11-1777/TP

• 系统结构 •

通用图形处理器缓存子系统性能优化方法综述

1. 1(东华理工大学江西省放射性地学大数据技术工程实验室 南昌 330013);2(东华理工大学信息工程学院 南昌 330013);3(东华理工大学创新创业学院 南昌 330013);4(武汉大学计算机学院 武汉 430072);5(南京审计大学 南京 211815) (zhangjun_whu@whu.edu.cn)
• 出版日期: 2020-06-01
• 基金资助:
国家自然科学基金项目(61662002,61972293,61902189)；江西省放射性地学大数据技术工程实验室项目(JELRGBDT201905)；江苏省基础研究计划(自然科学基金)项目(BK20180821)

Performance Optimization of Cache Subsystem in General Purpose Graphics Processing Units: A Survey

Zhang Jun1,2, Xie Jingcheng2, Shen Fanfan5, Tan Hai3, Wang Lümeng4, He Yanxiang4

1. 1(Jiangxi Engineering Laboratory on Radioactive Geoscience and Big Data Technology, Eastern China University of Technology, Nanchang 330013);2(College of Information Engineering, Eastern China University of Technology, Nanchang 330013);3(School of Innovation and Entrepreneurship, Eastern China University of Technology, Nanchang 330013);4(School of Computer Science, Wuhan University, Wuhan 430072);5(Nanjing Audit University, Nanjing 211815)
• Online: 2020-06-01
• Supported by:
This work was supported by the National Natural Science Foundation of China (61662002, 61972293, 61902189), the Project of Jiangxi Engineering Laboratory on Radioactive Geoscience and Big Data Technology (JELRGBDT201905), the Natural Science Foundation of Jiangsu Province(BK20180821).

Abstract: With the development of process technology and the improvement of architecture, the parallel computing performance of GPGPU(general purpose graphics processing units) is updated a lot, which makes GPGPU applied more and more widely in the fields of high performance and high throughput. GPGPU can obtain high parallel computing performance, as it can hide the long latency incurred by the memory accesses via supporting thousands of concurrent threads. Due to the existance of irregular computation and memory access in some applications, the performance of the memory subsystem is affected a lot, especially the contention of the on-chip cache can become serious, and the performance of GPGPU can not be up to the maximum. Alleviating the contention and optimizing the performance of the on-chip cache have become one of the main solutions to the optimization of GPGPU. At present, the studies of the performance optimization of the on-chip cache focus on five aspects, including TLP(thread level parallelism) throttling, memory access reordering, data flux enhancement, LLC(last level cache) optimization, and new architecture design based on NVM(non-volatile memory). This paper mainly discusses the performance optimization research methods of the on-chip cache from these aspects. In the end, some interesting research fields of the on-chip cache optimization in future are discussed. The contents of this paper have important significance on the research of the cache subsystem in GPGPU.