Abstract:
Reliable identification of protein families is a major challenge in bioinformatics. Clustering protein sequences may help to identify novel relationships among proteins. However, many clustering algorithms cannot be readily applied to protein sequences. One of the main problems is that the similarity between two protein sequences cannot be easily defined. A similarity analysis method based on traditional sequence alignment, which assumes conservation of contiguity between homologous segments, is inconsistent with genetic recombination. Especially for remote homology protein family members which possess similar structure or related function, this method cannot achieve correct results. Information about protein motifs is very important to the analysis of protein family sequences. In this paper, a novel protein sequence family mining algorithm called ProFaM is proposed. The ProFaM algorithm is a two-step method. In the first step, conserved motifs across protein sequences are mined using efficient prefix-projected strategy without candidate, and then based on these result motifs, combined with weight of motifs, a novel similarity measure function is constructed. In the second step, protein family sequences are clustered using a shared nearest neighbor method according to new similarity measure. Experiments on protein family sequences database Pfam show that the ProFaM algorithm improves performance. The satisfactory experimental results suggest that ProFaM may be applied to protein family analysis.