ISSN 1000-1239 CN 11-1777/TP

• 论文 •

### 基于EM的启动子序列半监督学习

1. 1(烟台大学计算机科学与技术学院 山东烟台 264005) 2(青岛大学国际学院 山东青岛 266071) (wanglh@ytu.edu.cn)
• 出版日期: 2009-11-15

### Semi-supervised Learning of Promoter Sequences Based on EM Algorithm

Wang Lihong1, Zhao Xianjia2, and Wu Shuanhu1

1. 1 (School of Computer Science and Technology, Yantai University, Yantai, Shandong 264005) 2 (International Colleges, Qingdao University, Qingdao, Shandong 266071)
• Online: 2009-11-15

Abstract: The eukaryotic promoter prediction is one of the most important problems in DNA sequence analysis. Promoter is a short sub-sequence before a transcriptional start site(TSS) in a DNA sequence. The prediction of the position of a promoter may approximately describe the position of a TSS, and gives help to biology experiments. Most proposed prediction algorithms are based on some search strategies, such as search by signal, search by content or search by CpG island, their performances are still limited by low sensitivities and high false positives. The promoter classification algorithm based on Markov chain has been proved to be effective in promoter prediction, where parameters such as transition probabilities are calculated by statistics on the labeled samples. In this paper, semi-supervised learning is introduced in promoter sequence analysis to improve classification accuracy with a combination of labeled and unlabeled sequences, and the maximum likelihood estimation formulas for transition probabilities are deduced. In simulating experiments, each long genomic sequence is truncated to short segments, which are mixed with labeled data, and classified according to the calculated probabilities. Comparison with some known prediction algorithms show that semi-supervised learning of promoter sequences based on EM algorithm is efficient when the number of labeled data is small, and the value of F\-1 is higher than that of predictions based on labeled samples.