Citation: | Chen Bingzhang, Liu Wei, Yu Xiaoyu. C-AMAT Measurement Method Based on Cache Access Mode and Its Application in Graph Computing[J]. Journal of Computer Research and Development, 2024, 61(4): 824-839. DOI: 10.7544/issn1000-1239.202220818 |
Graph application is an important branch in the field of big data. Although graph analysis has more significant performance advantages than traditional relational databases in displaying the relationship between entities, the irregular memory access pattern caused by a large number of random accesses in graph processing destroys the time and space locality of memory access, thus causing great performance pressure on the off-chip memory system. Therefore, how to correctly measure the performance of graph application in memory system is crucial for efficient graph application architecture optimization. As an extension of average memory access time (AMAT), concurrent average memory access time (C-AMAT) takes into account the locality and concurrency of memory access, and can more accurately evaluate and analyze the performance of modern processors in the storage system. However, the C-AMAT model ignores the fact that the lower-level cache layer of the processor accesses serially, which will lead to the inaccuracy of the calculation. At the same time, it is difficult to obtain the parameters required for the calculation due to the “pure miss cycle” and other reasons, which also makes it difficult for C-AMAT to be applied in practice. In order to match the computing model of C-AMAT with the memory access mode in modern computers, we propose parallel C-AMAT (PC-AMAT) and serial C-AMAT (SC-AMAT) based on C-AMAT. PC-AMAT and SC-AMAT respectively extend and characterize the computing model of C-AMAT from the parallel and serial access modes of cache. On this basis, we design and implement a “pure miss cycle” extraction algorithm to avoid the huge hardware overhead caused by direct measurement. The experimental results show that the correlation between PC-AMAT and SC-AMAT, and IPC is stronger than that of C-AMAT in single-core and multi-core mode. Finally, PC-AMAT and SC-AMAT are used to measure and analyze the memory performance of graph application, based on which the optimization strategy of graph application access is proposed.
[1] |
Basak A, Li Shuangchen, Hu Xing, et al. Analysis and optimization of the memory hierarchy for graph processing workloads[C]// Proc of the 25th IEEE Int Symp on High Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2019: 373−386
|
[2] |
Sun Xianhe, Wang Dawei. Concurrent average memory access time[J]. Computer, 2013, 47(5): 74−80
|
[3] |
Binkert N, Beckmann B, Black G, et al. The gem5 simulator[J]. SIGARCH Computer Architecture News, 2011, 39(2): 1−7 doi: 10.1145/2024716.2024718
|
[4] |
Beutel A, Akoglu L, Faloutsos C. Graph-based user behavior modeling: From prediction to fraud detection[C]// Proc of the 21st ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2015: 2309−2310
|
[5] |
Barabási A L, Albert R. Emergence of scaling in random networks[J]. Science, 1999, 286(5439): 509−512 doi: 10.1126/science.286.5439.509
|
[6] |
孙相征,张云泉,王婷,等. 对角线稀疏矩阵的 SpMV 自适应性能优化[J]. 计算机研究与发展,2013,50(3):648−656
Sun Xiangzheng, Zhang Yunquan, Wang Ting, et al. Auto- tuning of SpMV for diagonal sparse matrices[J]. Journal of Computer Research and Development, 2013, 50(3): 648−656 (in Chinese)
|
[7] |
Balaji V, Crago N, Jaleel A, et al. P-opt: Practical optimal cache replacement for graph analytics[C]// Proc of the 27th IEEE Int Symp on High Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2021: 668−681
|
[8] |
汤嘉武,郑龙,廖小飞,等. 面向高性能图计算的高效高层次综合方法[J]. 计算机研究与发展,2021,58(3):467−478 doi: 10.7544/issn1000-1239.2021.20190679
Tang Jiawu, Zheng Long, Liao Xiaofei, et al. Effective high-level synthesis for high-performance graph processing[J]. Journal of Computer Research and Development, 2021, 58(3): 467−478 (in Chinese) doi: 10.7544/issn1000-1239.2021.20190679
|
[9] |
Faldu P, Diamond J, Grot B. Domain-specialized cache management for graph analytics[C]// Proc of the 26th IEEE Int Symp on High Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2020: 234−248
|
[10] |
Cooksey R, Jourdan S, Grunwald D. A stateless, content-directed data prefetching mechanism[J]. ACM SIGPLAN Notices, 2002, 37(10): 279−290 doi: 10.1145/605432.605427
|
[11] |
Agarwal A, Roy K, Vijaykumar T N. Exploring high bandwidth pipelined cache architecture for scaled technology[C]// Proc of the 6th Design, Automation and Test in Europe Conf and Exhibition. Piscataway, NJ: IEEE, 2003: 778−783
|
[12] |
Kroft D. Lockup-free instruction fetch/prefetch cache organization[C]// Proc of the 25th Int Symp on Computer Architecture. New York: ACM, 1998: 195−201
|
[13] |
Beamer S, Asanović K, Patterson D. The GAP benchmark suite[J]. arXiv preprint, arXiv: 1508. 03619, 2015
|
[14] |
Wang Dawei, Sun Xianhe. APC: A novel memory metric and measurement methodology for modern memory systems[J]. IEEE Transactions on Computers, 2013, 63(7): 1626−1639
|
[15] |
Soltis D, Gibson M. Itanium 2 processor microarchitecture overview[C]// Proc of the 14th Hot Chips. Los Alamitos: IEEE Computer Society, 2002: 44−54
|
[16] |
姚永斌. 超标量处理器设计[M]. 北京:清华大学出版社,2014
Yao Yongbin. Superscalar Processor Design[M]. Beijing: Tsinghua University Press, 2014 (in Chinese)
|
[17] |
Gold X. 6130-Intel-WikiChip[EB/OL]. [2022-07-16].https://en.wikichip.org/ wiki/WikiChip
|
[18] |
孙贤和. C-AMAT:大数据时代的数据存取模型[J]. 中国计算机学会通讯,2014,10(6):19−22
Sun Xianhe. C-AMAT: Data access model in the age of big data[J]. Communications of the CCF, 2014, 10(6): 19−22 (in Chinese)
|
[19] |
Fog A. The Microarchitecture of Intel, AMD and VIA CPUs : An optimization guide for assembly programmers and compiler makers[R]. Denmark: Copenhagen University College of Engineering, 2022
|
[20] |
Beamer S, Asanovic K, Patterson D. Direction-optimizing breadth-first search[C/OL]// Proc of the 12th Int Conf on High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2012 [2022-07-16]. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=646845 8
|
[21] |
Brin S, Page L. The anatomy of a large-scale hypertextual web search engine[J]. Computer Networks and ISDN Systems, 1998, 30(1-7): 107−117 doi: 10.1016/S0169-7552(98)00110-X
|
[22] |
Madduri K, Ediger D, Jiang K, et al. A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets[C]// Proc of the 23rd IEEE Int Symp on Parallel & Distributed Processing. Piscataway, NJ: IEEE, 2009: 1−8
|
[23] |
Sutton M, Ben-Nun T, Barak A. Optimizing parallel graph connectivity computation via subgraph sampling[C]// Proc of the 32nd IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2018: 12−21
|
[24] |
da Rosa Righi R, Lehmann M, Gomes M M, et al. A survey on global management view: Toward combining system monitoring, resource management, and load prediction[J]. Journal of Grid Computing, 2019, 17(9): 473−502
|
[25] |
VanVoorhis C R W, Morgan B L. Understanding power and rules of thumb for determining sample sizes[J]. Tutorials in Quantitative Methods for Psychology, 2007, 3(2): 43−50 doi: 10.20982/tqmp.03.2.p043
|
[26] |
叶楠,郝子宇,郑方,等. BFS算法与众核处理器的适应性研究[J]. 计算机研究与发展,2015,52(5):1187−1197 doi: 10.7544/issn1000-1239.2015.20140004
Ye Nan, Hao Ziyu, Zheng Fang, et al. Adaptability of BFS algorithm and many-core processor[J]. Journal of Computer Research and Development, 2015, 52(5): 1187−1197 (in Chinese) doi: 10.7544/issn1000-1239.2015.20140004
|
[27] |
Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach[M]. Amsterdam: Elsevier, 2011
|
[1] | Su Wen, Zhang Longbing, Gao Xiang, Su Menghao. A Cache Locking and Direct Cache Access Based Network Processing Optimization Method[J]. Journal of Computer Research and Development, 2014, 51(3): 681-690. |
[2] | Zhao Xinjie, Wang Tao, Guo Shize, Liu Huiying. Cache Attacks on Block Ciphers[J]. Journal of Computer Research and Development, 2012, 49(3): 453-468. |
[3] | Zhao Xinjie, Wang Tao, Guo Shize, Liu Huiying. Cache Attacks on Block Ciphers[J]. Journal of Computer Research and Development, 2012, 49(3): 453-468. |
[4] | Wu Junjie, Yang Xuejun, Zeng Kun, Zhang Baida, Feng Quanyou, Liu Guanghui, and Tang Yuhua. DOOC: A Software/Hardware Co-managed Cache Architecture for Reducing Cache Thrashing[J]. Journal of Computer Research and Development, 2008, 45(12): 2020-2032. |
[5] | Zhou Qian, Feng Xiaobing, and Zhang Zhaoqing. Software Pipelining with Cache Profiling Information[J]. Journal of Computer Research and Development, 2008, 45(5): 834-840. |
[6] | Zhou Hongwei, Zhang Chengyi, and Zhang Minxuan. A Method of Statistics-Based Cache Leakage Power Estimation[J]. Journal of Computer Research and Development, 2008, 45(2): 367-374. |
[7] | Chen Gang, Zhang Weiwen, and Wu Guoxin. Replacement Solutions for Streaming Cache on P2P Network[J]. Journal of Computer Research and Development, 2007, 44(11): 1857-1865. |
[8] | Zhou Xuehai, Yu Jie, Li Xi, and Wand Zhigang. Research on Reliability Evaluation of Cache Based on Instruction Behavior[J]. Journal of Computer Research and Development, 2007, 44(4): 553-559. |
[9] | Huan Dandan, Li Zusong, Hu Weiwu, Liu Zhiyong. A Cache Adaptive Write Allocate Policy[J]. Journal of Computer Research and Development, 2007, 44(2): 348-354. |
[10] | Ma Zhiqiang, Ji Zhenzhou, and Hu Mingzeng. A Low-Power Instruction Cache Design Based on Record Buffer[J]. Journal of Computer Research and Development, 2006, 43(4): 744-751. |