• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Kong Hao, Lu Wenyan, Chen Yan, Yan Guihai, Li Xiaowei. Survey of Sort Acceleration Methods on FPGA[J]. Journal of Computer Research and Development, 2024, 61(3): 780-798. DOI: 10.7544/issn1000-1239.202220789
Citation: Kong Hao, Lu Wenyan, Chen Yan, Yan Guihai, Li Xiaowei. Survey of Sort Acceleration Methods on FPGA[J]. Journal of Computer Research and Development, 2024, 61(3): 780-798. DOI: 10.7544/issn1000-1239.202220789

Survey of Sort Acceleration Methods on FPGA

Funds: This work was supported by the National Natural Science Foundation of China (62002340, 61872336, 62090020), the Strategic Priority Research Program of Chinese Academy of Sciences (XDB44030100), and the Youth Innovation Promotion Association CAS (Y201923).
More Information
  • Author Bio:

    Kong Hao: born in 1995. PhD candidate. Student member of CCF. His main research interests include database accelerator and domain-specific computer architecture

    Lu Wenyan: born in 1990. PhD, associate professor. Member of CCF. His main research interests include deep learning accelerator, database accelerator, and domain-specific computer architecture and heterogeneous computing

    Chen Yan: born in 1985. Master. His main research interests include big data, database, and heterogeneous computing

    Yan Guihai: born in 1982. PhD, professor, PhD supervisor. Member of CCF. His main research interests include computer architecture, domain-specific accelerator design, and intelligent chip architecture

    Li Xiaowei: born in 1964. PhD, professor, PhD supervisor. Fellow of CCF. His main research interests include VLSI design, reliability design, and fault-tolerant computing

  • Received Date: September 13, 2022
  • Revised Date: April 17, 2023
  • Available Online: November 30, 2023
  • For sort acceleration on FPGA, the selection and optimization of various performance metrics, such as latency, throughput, power efficiency, hardware utilization and bandwidth efficiency, etc., are of critical importance. We compare the evolution of performance-driven sort acceleration, with advances in larger data size, more data types, more algorithm support, hardware-software cooperation and new hardware-based design; we analyze the problems and optimization strategies faced at different stages of design, implementation, testing and so on. Among the numerous sorting algorithms, merge sort becomes mainstream due to its excellent hardware parallelism, scalability and simple control logic. Sort acceleration is an architectural design that is deeply tied to specific application scenarios. We analyze the architectural adjustments made from the perspective of database system acceleration for resource competition, data arrangement, unique operations and diversity of user requests problems faced in databases. At last, to address the problems and shortcomings of existing studies, we provide an outlook on future directions in terms of distributed sort acceleration for very large data scale, the introduction of new hardware devices such as data processing unit, and the improvement of auxiliary tool chains such as high level synthesis to drive the iterative update of sort acceleration design.

  • [1]
    Marcelino R, Neto H C, Cardoso J M. Unbalanced FIFO sorting for FPGA-based systems[C] //Proc of the 16th IEEE Int Conf on Electronics, Circuits and Systems. Piscataway, NJ: IEEE, 2009: 431−434
    [2]
    Ortiz J, Andrews D. A configurable high-throughput linear sorter system[C/OL] //Proc of Int Symp on Parallel & Distributed Processing, Workshops and PhD Forum. Piscataway, NJ: IEEE, 2010[2023-03-20].https://ieeexplore.ieee.org/abstract/document/5470730
    [3]
    Sklyarov V, Skliarova I, Mihhailov D, et al. Implementation in FPGA of address-based data sorting[C] //Proc of the 21st Int Conf on Field Programmable Logic and Applications. Piscataway, NJ: IEEE, 2011: 405−410
    [4]
    Koch D, Torresen J. FPGASort: A high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting[C] //Proc of the 19th ACM/SIGDA Int Symp on Field Programmable Gate Arrays. New York: ACM, 2011: 45−54
    [5]
    Mueller R, Teubner J, Alonso G. Sorting networks on FPGAs[J]. The VLDB Journal, 2012, 21(1): 1−23
    [6]
    Zuluaga M, Milder P, Püschel M. Computer generation of streaming sorting networks [C] //Proc of the 49th Annual Design Automation Conf. New York: ACM, 2012: 1241−1249
    [7]
    Sklyarov V, Skliarova I. High-performance implementation of regular and easily scalable sorting networks on an FPGA[J]. Microprocessors and Microsystems, 2014, 38(5): 470−484 doi: 10.1016/j.micpro.2014.03.003
    [8]
    Chen Ren, Siriyal S, Prasanna V. Energy and memory efficient mapping of bitonic sorting on FPGA[C] //Proc of the 23rd ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2015: 240−249
    [9]
    Matsumoto N, Nakano K, Ito Y. Optimal parallel hardware k-sorter and top k-sorter, with FPGA implementations[C] //Proc of the 14th Int Symp on Parallel and Distributed Computing. Piscataway, NJ: IEEE, 2015: 138−147
    [10]
    Kobayashi R, Kise K. Face: Fast and customizable sorting accelerator for heterogeneous many-core systems [C] //Proc of the 9th Int Symp on Embedded Multicore/Many-core Systems-on-Chip. Piscataway, NJ: IEEE, 2015: 49−56
    [11]
    Srivastava A, Chen Ren, Prasanna V, et al. A hybrid design for high performance large-scale sorting on FPGA[C/OL] //Proc of Int Conf on Reconfigurable Computing and FPGAs. Piscataway, NJ: IEEE, 2015[2023-03-20].https://ieeexplore.ieee.org/abstract/document/7393322
    [12]
    Song Wei, Koch D, Luján M, et al. Parallel hardware merge sorter[C] //Proc of the 24th Annual Int Symp on Field-Programmable Custom Computing Machines. Piscataway, NJ: IEEE, 2016: 95−102
    [13]
    Zhang Chi, Chen Ren, Prasanna V. High throughput large scale sorting on a CPU-FPGA heterogeneous platform[C] //Proc of the 30th IEEE Int Parallel and Distributed Processing Symp Workshops. Piscataway, NJ: IEEE, 2016: 148−155
    [14]
    Usui T, Van Chu T, Kise K. A cost-effective and scalable merge sorter tree on FPGAs[C] //Proc of the 4th Int Symp on Computing and Networking. Piscataway, NJ: IEEE, 2016: 47−56
    [15]
    Mashimo S, Van Chu T, Kise K. High-performance hardware merge sorter [C/OL]//Proc of the 25th Annual Int Symp on Field-Programmable Custom Computing Machines. Piscataway, NJ: IEEE, 2017[2023-03-20].https://ieeexplore.ieee.org/abstract/document/7966636
    [16]
    Jun S W, Xu Shuotao. Terabyte sort on FPGA-accelerated flash storages [C] //Proc of the 25th Annual Int Symp on Field-Programmable Custom Computing Machines. Piscataway, NJ: IEEE, 2017: 17−24
    [17]
    Elsayed E A, Kise K. Design and evaluation of a configurable hardware merge sorter for various output records[C] //Proc of the 12th Int Symp on Embedded Multicore/Many-core Systems-on-Chip. Piscataway, NJ: IEEE, 2018: 201−208
    [18]
    Saitoh M, Elsayed E A, Van Chu T, et al. A high-performance and cost-effective hardware merge sorter without feedback datapath [C] //Proc of the 26th Annual Int Symp on Field-Programmable Custom Computing Machines. Piscataway, NJ: IEEE, 2018: 197−204
    [19]
    Papaphilippou P, Brooks C, Luk W. FLiMS: Fast lightweight merge sorter [C] //Proc of Int Conf on Field-Programmable Technology. Piscataway, NJ: IEEE, 2018: 78−85
    [20]
    Manev K, Koch D. Large utility sorting on FPGAs [C] //Proc of Int Conf on Field-Programmable Technology. Piscataway, NJ: IEEE, 2018: 334−337
    [21]
    Elsayed E A, Kise K. Towards an efficient hardware architecture for odd-even based merge sorter [C] //Proc of the 13th Int Symp on Embedded Multicore/Many-core Systems-on-Chip. Piscataway, NJ: IEEE, 2019: 249−256
    [22]
    Papaphilippou P, Brooks C, Luk W. An adaptable high-throughput FPGA merge sorter for accelerating database analytics[C] //Proc of the 30th Int Conf on Field-Programmable Logic and Applications. Piscataway, NJ: IEEE, 2020: 65−72
    [23]
    Samardzic N, Qiao Weikang, Aggarwal V, et al. Bonsai: High-performance adaptive merge tree sorting[C] //Proc of the 47th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2020: 282−294
    [24]
    Elsayed E A, Kise K. High-performance and hardware-efficient odd-even based merge sorter[J]. IEICE Transactions on Information and Systems, 2020, 103(12): 2504−2517
    [25]
    Chen Han, Madaminov S, Ferdman M, et al. FPGA-accelerated samplesort for large data sets[C] //Proc of the 28th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2020: 222−232
    [26]
    Qiao Weikang, Oh J, Guo Licheng, et al. FANS: FPGA-accelerated near-storage sorting[C] //Proc of the 29th Annual Int Symp on Field-Programmable Custom Computing Machines. Piscataway, NJ: IEEE, 2021: 106−114
    [27]
    Kobayashi R, Miura K, Fujita N, et al. A sorting library for FPGA implementation in OpenCL programming [C/OL] //Proc of the 11th Int Symp on Highly Efficient Accelerators and Reconfigurable Technologies. New York: ACM, 2021[2023-03-20].https://dl.acm.org/doi/abs/10.1145/3468044.3468054
    [28]
    Salamat S, Haj Aboutalebi A, Khaleghi B, et al. NASCENT: Near-storage acceleration of database sort on SmartSSD[C] //Proc of the 29th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2021: 262−272
    [29]
    Romanous B, Rezvani M, Huang Junjie, et al. High-performance parallel radix sort on FPGA[C] //Proc of the 28th Annual Int Symp on Field-Programmable Custom Computing Machines. Piscataway, NJ: IEEE, 2020: 224−224
    [30]
    Casper J, Olukotun K. Hardware acceleration of database operations[C] //Proc of the 22nd ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2014: 151−160
    [31]
    Sukhwani B, Thoennes M, Min H, et al. Large payload streaming database sort and projection on FPGAs[C] //Proc of the 25th Int Symp on Computer Architecture and High Performance Computing. Piscataway, NJ: IEEE, 2013: 25−32
    [32]
    Xu Shuotao, Bourgeat T, Huang Tianhao, et al. Aquoman: An analytic-query offloading machine[C] //Proc of the 53rd Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2020: 386−399
    [33]
    Wu L, Lottarini A, Paine T K, et al. Q100: The architecture and design of a database processing unit[J]. ACM SIGARCH Computer Architecture News, 2014, 42(1): 255−268 doi: 10.1145/2654822.2541961
    [34]
    郭诚欣,陈红,孙辉,等. 基于现代硬件的并行内存排序方法综述[J]. 计算机学报,2017,40(9):2070−2092 doi: 10.11897/SP.J.1016.2017.02070

    Guo Chengxin, Chen Hong, Sun Hui, et al. Parallelism of in-memory sorting algorithm on modern hardware[J]. Chinese Journal of Computers, 2017, 40(9): 2070−2092 (in Chinese) doi: 10.11897/SP.J.1016.2017.02070
    [35]
    Alquaied F A, Almudaifer A I, AlShaya M A. A novel high-speed parallel sorting algorithm based on FPGA[C/OL] //Proc of Saudi Int Electronics, Communications and Photonics Conf. Piscataway, NJ: IEEE, 2011[2023-03-20].https://ieeexplore.ieee.org/abstract/document/5877001
    [36]
    吕伟新,李清清,娄俊岭. FPGA 比较矩阵排序法及在中值滤波器中的应用[J]. 电子器件,2012,35(1):34−38 doi: 10.3969/j.issn.1005-9490.2012.01.009

    Lü Weixin, Li Qingqing, Lou Junling. FPGA comparison matrix sorting method and application in median filter[J]. Chinese Journal of Electron Devices, 2012, 35(1): 34−38 (in Chinese) doi: 10.3969/j.issn.1005-9490.2012.01.009
    [37]
    Sogabe Y, Maruyama T. FPGA acceleration of short read mapping based on sort and parallel comparison[C/OL] //Proc of the 24th Int Conf on Field Programmable Logic and Applications. Piscataway, NJ: IEEE, 2014[2023-03-20].https://ieeexplore.ieee.org/abstract/document/6927404
    [38]
    Fang Jian, Mulder Y T, Hidders J, et al. In-memory database acceleration on FPGAs: A survey[J]. The VLDB Journal, 2020, 29(1): 33−59 doi: 10.1007/s00778-019-00581-w
    [39]
    Papaphilippou P, Luk W. Accelerating database systems using FPGAs: A survey[C/OL] //Proc of the 28th Int Conf on Field Programmable Logic and Applications. Piscataway, NJ: IEEE, 2018[2023-03-20].https://ieeexplore.ieee.org/abstract/document/8533481
    [40]
    Harkins J, El-Ghazawi T, El-Araby E, et al. Performance of sorting algorithms on the SRC 6 reconfigurable computer[C] //Proc of Int Conf on Field-Programmable Technology. Piscataway, NJ: IEEE, 2005: 295−296
    [41]
    Sukhwani B, Thoennes M, Min Hong, et al. A hardware/software approach for database query acceleration with FPGAs[J]. International Journal of Parallel Programming, 2015, 43(6): 1129−1159 doi: 10.1007/s10766-014-0327-4
    [42]
    Chamberlain R D, Ganesan N. Sorting on architecturally diverse computer systems[C] //Proc of the 3rd Int Workshop on High-Performance Reconfigurable Computing Technology and Applications. New York: ACM, 2009: 39−46
    [43]
    Gupta PK. Accelerating datacenter workloads[C/OL] //Proc of the 26th Int Conf on Field Programmable Logic and Applications. Piscataway, NJ: IEEE, 2016[2023-03-20].https://www.fpl2016.org/slides/Gupta%20--%20Accelerating%20Datacenter%20Workloads.pdf
    [44]
    Cho M, Brand D, Bordawekar R, et al. PARADIS: An efficient parallel algorithm for in-place radix sort[J]. Proceedings of the VLDB Endowment, 2015, 8(12): 1518−1529 doi: 10.14778/2824032.2824050
    [45]
    Stehle E, Jacobsen H A. A memory bandwidth-efficient hybrid radix sort on GPUs [C] //Proc of ACM Int Conf on Management of Data. New York: ACM, 2017: 417−432
    [46]
    Stuecheli J. OpenCAPI ─ A New Standard for High Performance Memory, Acceleration and Networks [S/OL]. OpenCAPI Consortium, 2017[2023-03-20].https://opencapi.org/2017/04/21/opencapi-new-standard-high-performance-memory-acceleration-networks/
    [47]
    Wu L, Barker R J, Kim M A, et al. Navigating big data with high-throughput, energy-efficient data partitioning [C] //Proc of the 40th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2013: 249−260
    [48]
    Nyberg C, Shah M. Sort Benchmark[S/OL]. Hawaii: The Sort Benchmark Committee, 1987[2023-03-20]. http://sortbenchmark.org/
    [49]
    Transaction Processing Performance Council. TPC-H Benchmark H Standard Specification, Revision 3.0.1 [S/OL]. San Francisco, CA: TPC, 1993[2023-03-20].https://www.tpc.org/tpch/
    [50]
    Layer C, Pfleiderer H J. A reconfigurable recurrent bitonic sorting network for concurrently accessible data [C] //Proc of Int Conf on Field Programmable Logic and Applications. Berlin: Springer, 2004: 648−657
    [51]
    Farmahini-Farahani A, Duwe III H J, Schulte M J, et al. Modular design of high-throughput, low-latency sorting units[J]. IEEE Transactions on Computers, 2012, 62(7): 1389−1402
    [52]
    Saitoh M, Kise K. Very massive hardware merge sorter [C] //Proc of Int Conf on Field-Programmable Technology. Piscataway, NJ: IEEE, 2018: 86−93
    [53]
    Lu Wenyan, Chen Yan, Wu Jingya, et al. DOE: Database offloading engine for accelerating SQL processing [C] //Proc of the 38th Int Conf on Data Engineering Workshops. Piscataway, NJ: IEEE, 2022: 129−134
    [54]
    张东站,苏志锋,林子雨,等. 基于关系数据库的Top- K聚合关键词查询[J]. 计算机研究与发展,2014,51(4):918−929 doi: 10.7544/issn1000-1239.2014.20120645

    Zhang Dongzhan, Su Zhifeng, Lin Ziyu, et al. Top- K aggregation keyword search over relational databases[J]. Journal of Computer Research and Development, 2014, 51(4): 918−929 (in Chinese) doi: 10.7544/issn1000-1239.2014.20120645
    [55]
    The Apache Software Foundation. Apache Arrow [S/OL]. Berkeley, CA: The Apache Software Foundation, 2018[2023-03-20].https://arrow.apache.org/
    [56]
    Levenson A, Mokashi A, Noland B, et al. Apache Parquet [S]. Berkeley, CA: Apache Parquet Committers and PMC, 2016: 325−335
    [57]
    Shafranovich Y. Common Format and MIME Type for Comma-separated Values (CSV) Files [S/OL]. 2005[2023-03-20].https://www.rfc-editor.org/rfc/rfc4180
    [58]
    Dang V B, Mohajerani K, Gaj K. High-speed hardware architectures and FPGA benchmarking of crystals-kyber, ntru, and saber[J]. IEEE Transactions on Computers, 2022, 72(2): 306−320
    [59]
    Peltenburg J, Van Straten J, Wijtemans L, et al. Fletcher: A framework to efficiently integrate FPGA accelerators with apache arrow [C] //Proc of the 29th Int Conf on Field Programmable Logic and Applications. Piscataway, NJ: IEEE, 2019: 270−277
    [60]
    Williams S, Waterman A, Patterson D. Roofline: An insightful visual performance model for multicore architectures[J]. Communications of the ACM, 2009, 52(4): 65−76 doi: 10.1145/1498765.1498785
    [61]
    Da Silva B, Braeken A, D’Hollander E H, et al. Performance modeling for FPGAs: Extending the Roofline model with high-level synthesis tools[J]. International Journal of Reconfigurable Computing, 2013, 2013(7): 7−17
    [62]
    Chen Xinyu, Yao Chen, Bajaj R, et al. Is FPGA useful for hash joins? [C/OL]// Proc of the 10th Annual Conf on Innovative Data Systems Research. [2023-03-20].https://www.cidrdb.org/cidr2020/papers/p27-chen-cidr20.pdf
  • Related Articles

    [1]Guo Jinyang, Shao Chuanming, Wang Jing, Li Chao, Zhu Haojin, Guo Minyi. Programming and Developing Environment for FPGA Graph Processing: Survey and Exploration[J]. Journal of Computer Research and Development, 2020, 57(6): 1164-1178. DOI: 10.7544/issn1000-1239.2020.20200106
    [2]Shi Song, Li Hongliang, Zhu Wei. Efficient Merge Sort Algorithms on Array-Based Manycore Architectures[J]. Journal of Computer Research and Development, 2016, 53(2): 362-373. DOI: 10.7544/issn1000-1239.2016.20148246
    [3]Wang Tong, Jiang Haitao, Zhu Daming. A New Lower Bound for the Distance of Sorting Permutations by Short Block Moves[J]. Journal of Computer Research and Development, 2015, 52(11): 2622-2627. DOI: 10.7544/issn1000-1239.2015.20148259
    [4]Lu Min, Huang Yalou, Xie Maoqiang, Wang Yang, Liu Jie, Liao Zhen. Cost-Sensitive Listwise Ranking Approach[J]. Journal of Computer Research and Development, 2012, 49(8): 1738-1746.
    [5]Zheng Liping, Chan Bin, Wang Wenping, Liu Xiaoping, Cao Li, Kuang Zhengzheng. Remote Visualization Based on Distributed Rendering Framework[J]. Journal of Computer Research and Development, 2012, 49(7): 1438-1449.
    [6]Zhang Zhiqiang, Song Weitao, Xie Xiaoqin. An Efficient Ontology Ranking Algorithm—MIDSRank[J]. Journal of Computer Research and Development, 2011, 48(6): 1077-1088.
    [7]Hao Fanchang, Luan Junfeng, Zhu Daming, Zhang Peng, and Li Ming. A Faster Algorithm for Sorting Genomes by Reciprocal Translocation, Insertion and Deletion[J]. Journal of Computer Research and Development, 2010, 47(11): 2011-2023.
    [8]Yang Lei and Song Tao. The Array-Based Bucket Sort Algorithm[J]. Journal of Computer Research and Development, 2007, 44(2): 341-347.
    [9]Zhu En, Yin Jianping, and Zhang Guomin. Multiple Reference Minutiae Based Fingerprint Matching[J]. Journal of Computer Research and Development, 2005, 42(10): 1733-1739.
    [10]Qin Liangxi, Shi Zhongzhi. SFP-Max—A Sorted FP-Tree Based Algorithm for Maximal Frequent Patterns Mining[J]. Journal of Computer Research and Development, 2005, 42(2): 217-223.
  • Cited by

    Periodical cited type(0)

    Other cited types(1)

Catalog

    Article views (230) PDF downloads (136) Cited by(1)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return