Shen Jie, Long Biao, Jiang Hao, Huang Chun. Implementation and Optimization of Vector Trigonometric Functions on Phytium Processors[J]. Journal of Computer Research and Development, 2020, 57(12): 2610-2620. DOI: 10.7544/issn1000-1239.2020.20190721
Citation:
Shen Jie, Long Biao, Jiang Hao, Huang Chun. Implementation and Optimization of Vector Trigonometric Functions on Phytium Processors[J]. Journal of Computer Research and Development, 2020, 57(12): 2610-2620. DOI: 10.7544/issn1000-1239.2020.20190721
Shen Jie, Long Biao, Jiang Hao, Huang Chun. Implementation and Optimization of Vector Trigonometric Functions on Phytium Processors[J]. Journal of Computer Research and Development, 2020, 57(12): 2610-2620. DOI: 10.7544/issn1000-1239.2020.20190721
Citation:
Shen Jie, Long Biao, Jiang Hao, Huang Chun. Implementation and Optimization of Vector Trigonometric Functions on Phytium Processors[J]. Journal of Computer Research and Development, 2020, 57(12): 2610-2620. DOI: 10.7544/issn1000-1239.2020.20190721
(College of Computer, National University of Defense Technology, Changsha 410073)
Funds: This work was supported by the National Science and Technology Major Projects of Hegaoji (2018ZX01029-103), the National Natural Science Foundation of China (61902407), and Hunan Provincial Natural Science Foundation of China (2018JJ3616).
Benefitting from SIMD (single instruction multiple data) vectorization, processors’ floating-point compute capability has been increased largely. However, the current SIMD units and SIMD instruction sets only support basic operations like arithmetic operations (addition, subtraction, multiplication, and division) and logical operations, and do not provide direct support for floating-point transcendental functions. Since transcendental functions are the most time-consuming functions in floating-point computing, improving these functions’ performance has become a key point in math library optimization. In this paper, we design and propose a new method that utilizes SIMD units to vectorize and optimize trigonometric functions (which are one class of transcendental functions). While most vector implementations use a unified algorithm to process all floating-point numbers, we select and import several optimizable branches from the scalar implementations to process different ranges of floating-point numbers. We further utilize a series of optimization techniques to accelerate the vectorized scalar code. By combining the piecewise computing of the scalar implementations and the vectorization advantage of the vector implementations, our method optimizes branch processing in vector trigonometric functions, reduces redundant computation, and increases the utilization of SIMD units. Experimental results show that our method meets accuracy requirement, and effectively improves trigonometric functions’ performance. Compared with original vector trigonometric functions, the average performance speedup of optimized functions is 2.04x.