基于图划分的全基因组并行拼接算法

A New Data Clustering Algorithm for Parallel Whole-Genome Shotgun Sequence Assembly

摘要: 提出了一种基于图划分的全基因组并行拼接算法.该算法巧妙地将数据划分问题转化成图划分的问题，解决了传统数据划分算法中存在的节点负载不平衡的问题.同时，算法在建立关系图时有效地利用了WGS测序中所提供reads之间的长度信息和配对信息，使reads关系图能更准确地反映出数据之间的关系特性，从而提高了数据划分的准确性.实验结果表明，该算法可以准确地划分各种模拟数据、真实数据的数据集，相对于传统数据划分算法划分质量有了明显改善.

Abstract: Presented in this paper is a data clustering method based on graph-partition in parallel whole-genome sequence assembly. The algorithm transforms the data clustering problem into graph partition problem which helps to solve the load unbalancing in the parallel assembly stage. In addition, the method improves the quality of clustering by adding paired mate information into the read-relation graph which shows relationship between reads accurately. Experiments in both artificial and real genome data sets show that the data clustering method can obtain high quality clustered data and outperforms the traditional method significantly.