Abstract:
Presented in this paper is a data clustering method based on graph-partition in parallel whole-genome sequence assembly. The algorithm transforms the data clustering problem into graph partition problem which helps to solve the load unbalancing in the parallel assembly stage. In addition, the method improves the quality of clustering by adding paired mate information into the read-relation graph which shows relationship between reads accurately. Experiments in both artificial and real genome data sets show that the data clustering method can obtain high quality clustered data and outperforms the traditional method significantly.