ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (2): 362-373.doi: 10.7544/issn1000-1239.2016.20148246

• 系统结构 • 上一篇    下一篇

阵列众核处理器上的高效归并排序算法

石嵩,李宏亮,朱巍   

  1. (江南计算技术研究所 江苏无锡 214083) (shisong1990@163.com)
  • 出版日期: 2016-02-01
  • 基金资助: 
    国家“八六三”高技术研究发展计划基金项目(2014AA01A301);“核高基”国家科技重大专项基金项目(2013zx0102-8001-001-001)

Efficient Merge Sort Algorithms on Array-Based Manycore Architectures

Shi Song, Li Hongliang, Zhu Wei   

  1. (Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214083)
  • Online: 2016-02-01

摘要: 排序是计算机科学中最基本的问题之一,随着众核处理器结构的不断发展,设计众核结构上的高效排序算法具有重要意义.众核处理器的一个重要方向是阵列众核处理器,根据阵列众核处理器的结构特点,提出了2种面向阵列众核结构的高效归并排序算法,通过利用DMA(direct memory access)多缓冲机制提高访存效率、深度平衡归并策略保持众多核心之间的负载均衡、SIMD(single instruction multiple data)归并方法提高归并计算效率以及片上交换归并策略提高片上数据重用率,大幅度提高了阵列众核处理器的排序性能.在异构融合阵列众核处理器DFMC(deeply-fused many-core)原型系统的实验结果表明,算法排序速度达647 MKeys/s(million keys per second),其排序效率(排序速度/峰值性能)是NVIDIA GPU上最快的归并排序算法(GTX580平台)的3.3倍,是Intel Xeon Phi上最快的归并排序算法的2.7倍.最后,建立了阵列众核处理器上归并排序算法的性能分析模型,利用该模型分析了主要结构参数与算法性能的关系,对阵列众核处理器的研究有一定的指导意义.

关键词: 阵列众核, 归并排序, 排序网络, 单指令多数据流, 单程序多数据流, 片上通信

Abstract: Sorting is one of the most fundamental problems in the field of computer science. With the rapid development of manycore processors, it shows great importance to design efficient sort algorithms on manycore architecture. An important trend of manycore processors is array-based manycore architecture. This paper presents two efficient merge sort algorithms on array-based manycore architectures. Our algorithms achieve high performance by using DMA(direct memory access) multi-buffering strategy to improve the memory accessing efficiency, deeply balanced merging strategy to keep load-balancing between cores, SIMD(single instruction multiple data) merging method to exploit fine-grained parallelism and on-chip swap merging strategy to reuse on-chip data. Experimental results on DFMC(deeply-fused many-core) prototype system achieve a sorting rate of 647 MKeys/s(million keys per second), and the sorting efficiency(sorting rate/peak performance) is 3.3x higher than state-of-the-art GPU merge sort on GTX580, and 2.7x higher than the fastest merge sort on Intel Xeon Phi. Additionally, this paper establishes an analytical model to characterize the performance of our algorithms. Based on the analytical model, we analyze the influence of the main structural parameters to the performance of the algorithms, which will contribute to the study of many-core architecture.

Key words: array-based manycore, merge sort, sort network, SIMD, SPMD, on-chip communication

中图分类号: