高级检索
    沈洁, 龙标, 姜浩, 黄春. 飞腾处理器上向量三角函数的设计实现与优化[J]. 计算机研究与发展, 2020, 57(12): 2610-2620. DOI: 10.7544/issn1000-1239.2020.20190721
    引用本文: 沈洁, 龙标, 姜浩, 黄春. 飞腾处理器上向量三角函数的设计实现与优化[J]. 计算机研究与发展, 2020, 57(12): 2610-2620. DOI: 10.7544/issn1000-1239.2020.20190721
    Shen Jie, Long Biao, Jiang Hao, Huang Chun. Implementation and Optimization of Vector Trigonometric Functions on Phytium Processors[J]. Journal of Computer Research and Development, 2020, 57(12): 2610-2620. DOI: 10.7544/issn1000-1239.2020.20190721
    Citation: Shen Jie, Long Biao, Jiang Hao, Huang Chun. Implementation and Optimization of Vector Trigonometric Functions on Phytium Processors[J]. Journal of Computer Research and Development, 2020, 57(12): 2610-2620. DOI: 10.7544/issn1000-1239.2020.20190721

    飞腾处理器上向量三角函数的设计实现与优化

    Implementation and Optimization of Vector Trigonometric Functions on Phytium Processors

    • 摘要: 得益于单指令多数据(single instruction multiple data, SIMD)向量化技术,处理器浮点计算能力获得了成倍的提升,然而当前SIMD向量部件和指令集仅支持加、减、乘、除、逻辑运算等基本操作,对浮点超越函数没有提供直接的支持.作为浮点计算中最耗时的一类函数,如何提高其性能成为底层数学库优化工作的一个重点.面向超越函数中的三角函数,提出一种利用SIMD向量部件设计、实现与优化向量三角函数的方法.该方法结合标量数学库分段计算与向量数学库向量化实现的优势,增加和优化了向量三角函数中的分支处理,既减少了函数实现中的冗余计算,又提高了分支情况下向量部件的利用率.在飞腾处理器上的实验表明:所提优化方法既保证了向量三角函数的精度,同时有效提高了函数性能,与原始向量三角函数相比平均性能加速比为2.04倍.

       

      Abstract: Benefitting from SIMD (single instruction multiple data) vectorization, processors’ floating-point compute capability has been increased largely. However, the current SIMD units and SIMD instruction sets only support basic operations like arithmetic operations (addition, subtraction, multiplication, and division) and logical operations, and do not provide direct support for floating-point transcendental functions. Since transcendental functions are the most time-consuming functions in floating-point computing, improving these functions’ performance has become a key point in math library optimization. In this paper, we design and propose a new method that utilizes SIMD units to vectorize and optimize trigonometric functions (which are one class of transcendental functions). While most vector implementations use a unified algorithm to process all floating-point numbers, we select and import several optimizable branches from the scalar implementations to process different ranges of floating-point numbers. We further utilize a series of optimization techniques to accelerate the vectorized scalar code. By combining the piecewise computing of the scalar implementations and the vectorization advantage of the vector implementations, our method optimizes branch processing in vector trigonometric functions, reduces redundant computation, and increases the utilization of SIMD units. Experimental results show that our method meets accuracy requirement, and effectively improves trigonometric functions’ performance. Compared with original vector trigonometric functions, the average performance speedup of optimized functions is 2.04x.

       

    /

    返回文章
    返回