高级检索
    王文迪, 汤 文, 段 勃, 张春明, 张佩珩, 孙凝晖. 基于Hash索引的高通量基因序列比对并行加速技术研究[J]. 计算机研究与发展, 2013, 50(11): 2463-2471.
    引用本文: 王文迪, 汤 文, 段 勃, 张春明, 张佩珩, 孙凝晖. 基于Hash索引的高通量基因序列比对并行加速技术研究[J]. 计算机研究与发展, 2013, 50(11): 2463-2471.
    Wang Wendi, Tang Wen, Duan Bo, Zhang Chunming, Zhang Peiheng, Sun Ninghui. Parallel Accelerator Design for High-Throughput DNA Sequence Alignment with Hash-Index[J]. Journal of Computer Research and Development, 2013, 50(11): 2463-2471.
    Citation: Wang Wendi, Tang Wen, Duan Bo, Zhang Chunming, Zhang Peiheng, Sun Ninghui. Parallel Accelerator Design for High-Throughput DNA Sequence Alignment with Hash-Index[J]. Journal of Computer Research and Development, 2013, 50(11): 2463-2471.

    基于Hash索引的高通量基因序列比对并行加速技术研究

    Parallel Accelerator Design for High-Throughput DNA Sequence Alignment with Hash-Index

    • 摘要: 近年来随着高通量基因测序技术的迅速发展,测序成本和周期都得到了大幅降低.然而,新一代测序技术海量数据生成能力以及各类测序算法蕴含的高并发性却对现有计算机的运算能力提出了新挑战.以一个基于Hash索引算法实现的开源重测序程序(PerM)为例,研究了在商用多核CPU上加速该应用程序的关键技术.在一个64核SMP系统上的实验结果证明,提出的优化技术可以使Cache缺失率降低90%,性能提升4~11倍.接下来探讨了在一个包含Xilinx LX330 FPGA的加速卡上设计实现专用并行加速系统的相关问题.作为原型验证系统,在基于FPGA的PCIe加速卡上设计并实现了包含11个处理单元的脉动陈列并行计算系统.和Intel Xeon X7550 8核CPU相比,提出的并行加速器有30~65倍性能功耗比优势.

       

      Abstract: In recent years, due to the rapid development of high-throughput next generation sequencing (NGS) technologies, the sequencing cost and time have been greatly reduced. However, both the explosion of the generated NGS data and the massively parallel computation pose great challenges to the capability of existing computers. We take an open-source re-sequencing algorithm based on hash-index, called PerM, as an example to investigate the optimizations for accelerating NGS with commercial multi-core CPUs as well as with customized parallel architectures. Firstly, we optimize the original algorithm by reordering the bucket accessing sequences so that data locality in shared cache is improved. Secondly, to exclude the empty hash buckets, we propose a hash-index compression algorithm, which coincides with the sequential access nature of the optimized algorithm. The experiments on a 64-cores SMP (Intel Xeon X7550) show that the optimized algorithm reduces LLC miss ratio to about 10% of the original algorithm, therefore the overall performance can be improved by 4 to 11 times. Furthermore, a parallel accelerator architecture is designed and evaluated on our customized FPGA accelerator card with a Xilinx LX330 FPGA resident. As a prototype, a systolic array of 100 PEs is built, which operates at 175MHz. The performance of the proposed parallel accelerator architecture is justified by the reported speedup of 30 to 65 times over an 8-cores CPU.

       

    /

    返回文章
    返回