高级检索
    石嵩, 李宏亮, 朱巍. 阵列众核处理器上的高效归并排序算法[J]. 计算机研究与发展, 2016, 53(2): 362-373. DOI: 10.7544/issn1000-1239.2016.20148246
    引用本文: 石嵩, 李宏亮, 朱巍. 阵列众核处理器上的高效归并排序算法[J]. 计算机研究与发展, 2016, 53(2): 362-373. DOI: 10.7544/issn1000-1239.2016.20148246
    Shi Song, Li Hongliang, Zhu Wei. Efficient Merge Sort Algorithms on Array-Based Manycore Architectures[J]. Journal of Computer Research and Development, 2016, 53(2): 362-373. DOI: 10.7544/issn1000-1239.2016.20148246
    Citation: Shi Song, Li Hongliang, Zhu Wei. Efficient Merge Sort Algorithms on Array-Based Manycore Architectures[J]. Journal of Computer Research and Development, 2016, 53(2): 362-373. DOI: 10.7544/issn1000-1239.2016.20148246

    阵列众核处理器上的高效归并排序算法

    Efficient Merge Sort Algorithms on Array-Based Manycore Architectures

    • 摘要: 排序是计算机科学中最基本的问题之一,随着众核处理器结构的不断发展,设计众核结构上的高效排序算法具有重要意义.众核处理器的一个重要方向是阵列众核处理器,根据阵列众核处理器的结构特点,提出了2种面向阵列众核结构的高效归并排序算法,通过利用DMA(direct memory access)多缓冲机制提高访存效率、深度平衡归并策略保持众多核心之间的负载均衡、SIMD(single instruction multiple data)归并方法提高归并计算效率以及片上交换归并策略提高片上数据重用率,大幅度提高了阵列众核处理器的排序性能.在异构融合阵列众核处理器DFMC(deeply-fused many-core)原型系统的实验结果表明,算法排序速度达647 MKeys/s(million keys per second),其排序效率(排序速度/峰值性能)是NVIDIA GPU上最快的归并排序算法(GTX580平台)的3.3倍,是Intel Xeon Phi上最快的归并排序算法的2.7倍.最后,建立了阵列众核处理器上归并排序算法的性能分析模型,利用该模型分析了主要结构参数与算法性能的关系,对阵列众核处理器的研究有一定的指导意义.

       

      Abstract: Sorting is one of the most fundamental problems in the field of computer science. With the rapid development of manycore processors, it shows great importance to design efficient sort algorithms on manycore architecture. An important trend of manycore processors is array-based manycore architecture. This paper presents two efficient merge sort algorithms on array-based manycore architectures. Our algorithms achieve high performance by using DMA(direct memory access) multi-buffering strategy to improve the memory accessing efficiency, deeply balanced merging strategy to keep load-balancing between cores, SIMD(single instruction multiple data) merging method to exploit fine-grained parallelism and on-chip swap merging strategy to reuse on-chip data. Experimental results on DFMC(deeply-fused many-core) prototype system achieve a sorting rate of 647 MKeys/s(million keys per second), and the sorting efficiency(sorting rate/peak performance) is 3.3x higher than state-of-the-art GPU merge sort on GTX580, and 2.7x higher than the fastest merge sort on Intel Xeon Phi. Additionally, this paper establishes an analytical model to characterize the performance of our algorithms. Based on the analytical model, we analyze the influence of the main structural parameters to the performance of the algorithms, which will contribute to the study of many-core architecture.

       

    /

    返回文章
    返回