飞腾处理器上向量三角函数的设计实现与优化

沈洁; 龙标; 姜浩; 黄春

doi:10.7544/issn1000-1239.2020.20190721

飞腾处理器上向量三角函数的设计实现与优化

(国防科技大学计算机学院长沙 410073) (j.shen@nudt.edu.cn)

基金项目: “核高基”国家科技重大专项基金项目(2018ZX01029-103)；国家自然科学基金项目(61902407);湖南省自然科学基金资助项目(2018JJ3616)

详细信息

中图分类号: TP311
计量
- 文章访问数: 761
- HTML全文浏览量: 13
- PDF下载量: 264
出版历程
- 发布日期: 2020-11-30

Implementation and Optimization of Vector Trigonometric Functions on Phytium Processors

(College of Computer, National University of Defense Technology, Changsha 410073)

Funds: This work was supported by the National Science and Technology Major Projects of Hegaoji (2018ZX01029-103), the National Natural Science Foundation of China (61902407), and Hunan Provincial Natural Science Foundation of China (2018JJ3616).

摘要

摘要: 得益于单指令多数据(single instruction multiple data, SIMD)向量化技术，处理器浮点计算能力获得了成倍的提升，然而当前SIMD向量部件和指令集仅支持加、减、乘、除、逻辑运算等基本操作，对浮点超越函数没有提供直接的支持.作为浮点计算中最耗时的一类函数，如何提高其性能成为底层数学库优化工作的一个重点.面向超越函数中的三角函数，提出一种利用SIMD向量部件设计、实现与优化向量三角函数的方法.该方法结合标量数学库分段计算与向量数学库向量化实现的优势，增加和优化了向量三角函数中的分支处理，既减少了函数实现中的冗余计算,又提高了分支情况下向量部件的利用率.在飞腾处理器上的实验表明:所提优化方法既保证了向量三角函数的精度，同时有效提高了函数性能，与原始向量三角函数相比平均性能加速比为2.04倍.
- 向量三角函数 /
- 分段计算 /
- SIMD向量化 /
- 性能优化 /
- 飞腾处理器
Abstract: Benefitting from SIMD (single instruction multiple data) vectorization, processors’ floating-point compute capability has been increased largely. However, the current SIMD units and SIMD instruction sets only support basic operations like arithmetic operations (addition, subtraction, multiplication, and division) and logical operations, and do not provide direct support for floating-point transcendental functions. Since transcendental functions are the most time-consuming functions in floating-point computing, improving these functions’ performance has become a key point in math library optimization. In this paper, we design and propose a new method that utilizes SIMD units to vectorize and optimize trigonometric functions (which are one class of transcendental functions). While most vector implementations use a unified algorithm to process all floating-point numbers, we select and import several optimizable branches from the scalar implementations to process different ranges of floating-point numbers. We further utilize a series of optimization techniques to accelerate the vectorized scalar code. By combining the piecewise computing of the scalar implementations and the vectorization advantage of the vector implementations, our method optimizes branch processing in vector trigonometric functions, reduces redundant computation, and increases the utilization of SIMD units. Experimental results show that our method meets accuracy requirement, and effectively improves trigonometric functions’ performance. Compared with original vector trigonometric functions, the average performance speedup of optimized functions is 2.04x.
- vector trigonometric functions /
- segmented computing /
- SIMD vectorization /
- performance optimization /
- Phytium processors