A Cross-Platform Fine-Grained Performance Analysis Technique for Redundant Zeros

You Xin; Yang Hailong; Lei Kelun; Kong Xianghao; Xu Jun; Luan Zhongzhi; Qian Depei

doi:10.7544/issn1000-1239.202111189

Journal of Computer Research and Development > 2023 > 60(5): 1164-1176. > DOI: 10.7544/issn1000-1239.202111189

You Xin, Yang Hailong, Lei Kelun, Kong Xianghao, Xu Jun, Luan Zhongzhi, Qian Depei. A Cross-Platform Fine-Grained Performance Analysis Technique for Redundant Zeros[J]. Journal of Computer Research and Development, 2023, 60(5): 1164-1176. DOI: 10.7544/issn1000-1239.202111189

Citation:

PDF (3455 KB)

A Cross-Platform Fine-Grained Performance Analysis Technique for Redundant Zeros

1.
School of Computer Science and Engineering, Beihang University, Beijing 100191
2.
Science and Technology on Space System Simulation Laboratory, Beijing Simulation Center, Beijing 100854

Funds: This work was supported by the National Key Research and Development Program of China (2022ZD0117805), the National Natural Science Foundation of China (62072018, U22A2028), and the Fundamental Research Funds for the Central Universities.

More Information

Author Bio:
You Xin: born in 1997. PhD candidate. His main research interests include high performance computing, performance analysis tools, compile optimization

Yang Hailong: born in 1985, PhD, associate professor. Member of CCF. His main research interests include high performance computing, distributed and parallel computing, computer architecture, deep learning compilation

Lei Kelun: born in 2000. Undergraduate. His main research interest includes performance analysis tools

Kong Xianghao: born in 1999. Undergraduate. His main research interest includes high performance computing

Xu Jun: born in 1984. Senior engineer. Her main research interest includes modeling and simulation of weapon equipment system

Luan Zhongzhi: born in 1971. PhD, associate professor. His main research interests include distributed computing, high performance computing, parallel computing, computer architecture, cloud computing, and big data

Qian Depei: born in 1952. PhD, professor. Academician of Chinese Academy of Sciences. His main research interests include distributed computing, high performance computing and computer architecture
Received Date: November 29, 2021
Revised Date: June 06, 2022
Available Online: February 26, 2023

Graphical Abstract

Abstract

Abstract

Software inefficiencies caused by redundant zeros will introduce massive zero values to be loaded or used for trivial computation, which significantly wastes memory and compute resources. However, the compiler toolchain still cannot effectively identify the redundant operations dealing with zeros and hardware optimizations handling redundant zeros have not been adopted in commercial hardware yet. Although ZeroSpy can detect the existence of redundant zero buried within software and report sufficient information for performance optimization, its detection is still limited in Intel platform as well as its large overhead. Therefore, we propose a cross-platform tool DrZero to overcome these limitations. DrZero can detect redundant zeros in both x86 and ARM platforms and it implements novel online analysis based on buffered tracing for lower overhead. For ARM platform, we propose floating-point estimation via dataflow analysis to estimate the data type of a memory operand for further detection. The evaluation results demonstrate that DrZero can detect redundant zeros with code-centric, data-centric analysis on both x86 and ARM platforms with 45.31×, 54.20× and 14.12×, 13.40× performance overheads, respectively. Besides, DrZero incurs 37.2% and 55.8% lower time overheads than ZeroSpy with code-centric and data-centric analysis on the x86 platform, respectively. Based on the optimization guidance revealed by DrZero, we can achieve 1.76× and 2.12× speedups at maximum on both x86 and ARM platforms after eliminating redundant zeros for evaluated applications. DrZero is open-source at https://github.com/buaa-hipo/zerospy-drcctprof.
- binary instrumentation,
- redundant zero,
- software inefficiency,
- performance analysis and optimization,
- cross-platform

FullText(HTML)

References (31)

References

[1]	Su Pengfei, Wen Shasha, Yang Hailong, et al. Redundant loads: A software inefficiency indicator[C]//Proc of the 41st Int Conf on Software Engineering (ICSE). Piscataway, NJ: IEEE, 2019: 982−993
[2]	Lepak K M, Lipasti M H. On the value locality of store instructions[C]//Proc of the 27th Int Symp on Computer Architecture. New York: ACM, 2000: 182−191
[3]	Chabbi M, Mellor-Crummey J. DeadSpy: A tool to pinpoint program inefficiencies[C]//Proc of the 10th Int Symp on Code Generation and Optimization. New York: ACM, 2012: 124−134
[4]	Wen Shasha, Chabbi M, Liu Xu. RedSpy: Exploring value locality in software[C]//Proc of the 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2017: 47−61
[5]	Wen Shasha, Liu Xu, Chabbi M. Runtime value numbering: A profiling technique to pinpoint redundant computations[C]//Proc of the 24th Int Conf on Parallel Architecture and Compilation (PACT). Los Alamitos, CA: IEEE Computer Society, 2015: 254−265
[6]	Lee B, Jung J, Kim M. An all-zero block detection scheme for low-complexity HEVC encoders[J]. IEEE Transactions on Multimedia, 2016, 18(7): 1257−1268 doi: 10.1109/TMM.2016.2557075
[7]	Peng Kuoyou, Fu Shengyu, Liu Yuping, et al. Adaptive runtime exploiting sparsity in tensor of deep learning neural network on heterogeneous systems[C]//Proc of the 17th Int Conf on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). Piscataway, NJ: IEEE, 2017: 105−112
[8]	Delmas Lascorz A, Judd P, Stuart D M, et al. Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks[C]//Proc of the 24th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2019: 749−763
[9]	You Xin, Yang Hailong, Luan Zhongzhi, et al. ZeroSpy: Exploring software inefficiency with redundant zeros[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2021-02-22].https://ieeexplore.ieee.org/document/9355303
[10]	Perf Wiki. perf: Linux profiling with performance counters[EB/OL]. 2006[2021-07-29].https://perf.wiki.kernel.org/index.php/Main_Page
[11]	Adhianto L, Banerjee S, Fagan M, et al. HPCToolkit: Tools for performance analysis of optimized parallel programs[J]. Concurrency and Computation: Practice and Experience, 2010, 22(6): 685−701
[12]	Reinders J. VTune Performance Analyzer Essentials[M]. Danvers, MA: Intel Press, 2005
[13]	Graham S L, Kessler P B, McKusick M K. Gprof: A call graph execution profiler[J]. ACM SIGPLAN Notices, 1982, 17(6): 120−126 doi: 10.1145/872726.806987
[14]	DeRose L, Homer B, Johnson D, et al. Cray performance analysis tools[C]//Proc of the 2nd Int Workshop on Parallel Tools for High Performance Computing. Berlin: Springer, 2008: 191−199
[15]	Levon J. OProfile [EB/OL]. 2002 [2021-07-29].https://oprofile.sourceforge.io/news
[16]	Nakao M, Ueno K, Fujisawa K, et al. Performance evaluation of supercomputer Fugaku using breadth-first search benchmark in Graph500[C]//Proc of the IEEE Int Conf on Cluster Computing (CLUSTER). Los Alamitos, CA: IEEE Computer Society, 2020: 408−409
[17]	You Xin. DrZero[CP/OL]. 2021[2021-07-28].https://github.com/buaa-hipo/zerospy-drcctprof
[18]	郑祯,翟季冬,李焱,等. 基于CUPTI接口的典型GPU程序负载特征分析[J]. 计算机研究与发展,2016,53(6):1249−1262 doi: 10.7544/issn1000-1239.2016.20148354 Zheng Zhen, Zhai Jidong, Li Yan, et al. Workload analysis for typical GPU programs using CUPTI interface[J]. Journal of Computer Research and Development, 2016, 53(6): 1249−1262 (in Chinese) doi: 10.7544/issn1000-1239.2016.20148354
[19]	Calder B, Feller P, Eustace A. Value profiling[C]//Proc of the 30th Annual Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 1997: 259−269
[20]	Watterson S, Debray S. Goal-directed value profiling[C]//Proc of the 10th Int Conf on Compiler Construction. Berlin: Springer, 2001: 319−333
[21]	Tan Jialiang, Jiao Shuyin, Chabbi M, et al. What every scientific programmer should know about compiler optimizations?[C]//Proc of the 34th ACM Int Conf on Supercomputing. New York: ACM, 2020[2020-06-29].https://dl.acm.org/doi/10.1145/3392717.3392754
[22]	Luk C K, Cohn R, Muth R, et al. Pin: Building customized program analysis tools with dynamic instrumentation[J]. ACM SIGPLAN Notices, 2005, 40(6): 190−200 doi: 10.1145/1064978.1065034
[23]	Stephenson M, Babb J, Amarasinghe S. Bidwidth analysis with application to silicon compilation[J]. ACM SIGPLAN Notices, 2000, 35(5): 108−120 doi: 10.1145/358438.349317
[24]	Rubio-González C, Nguyen C, Nguyen H D, et al. Precimonious: Tuning assistant for floating-point precision[C]//Proc of the Int Conf on High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2013[2021-07-28].https://ieeexplore.ieee.org/document/6877460
[25]	Bruening D, Amarasinghe S. Efficient, transparent, and comprehensive runtime code manipulation[D]. Cambridge, MA: MIT Press, 2004
[26]	Zhao Qidong, Liu Xu, Chabbi M. DrCCTProf: A fine-grained call path profiler for ARM-based clusters[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ : IEEE, 2020[2021-02-22].https://ieeexplore.ieee.org/document/9355248
[27]	NASA Ames Research Center. NAS parallel benchmarks[EB/OL]. 1991[2021-07-28].https://www.nas.nasa.gov/software/npb.html
[28]	Che Shuai, Boyer M, Meng Jiayuan, et al. Rodinia: A benchmark suite for heterogeneous computing[C]//Proc of the 12th IEEE Int Symp on Workload Characterization (IISWC). Piscataway, NJ: IEEE, 2009: 44−54
[29]	Bucek J, Lange K D, Kistowski J. SPEC CPU2017: Next-generation compute benchmark[C]//Proc of the 9th ACM/SPEC Int Conf on Performance Engineering (ICPE). New York: ACM, 2018: 41−42
[30]	Chabbi M, Liu Xu, Mellor-Crummey J. Call paths for pin tools[C]//Proc of the 12th IEEE/ACM Int Symp on Code Generation and Optimization. New York : ACM, 2014: 76−86
[31]	Buluç A, Fineman J T, Frigo M, et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks[C]//Proc of the 21st Annual Symp on Parallelism in Algorithms and Architectures. New York : ACM, 2009: 233−244