一种只利用序列信息预测RNA结合蛋白的深度学习模型

李洪顺; 于华; 宫秀军

doi:10.7544/issn1000-1239.2018.20160508

一种只利用序列信息预测RNA结合蛋白的深度学习模型

A Deep Learning Model for Predicting RNA-Binding Proteins Only from Primary Sequences

摘要

摘要: RNA结合蛋白在选择性剪贴、RNA编辑及甲基化等多种生物功能中发挥非常重要的作用，从氨基酸序列预测这些蛋白的功能成为基因组功能注释领域的重要挑战之一. 传统的预测方法往往从序列中提取氨基酸的理化特性作为初始特征，忽略了motif及motif之间的位置信息，同时由于训练数据规模小、噪声大，导致预测的精度及可信度降低. 在此提出了一种从序列预测RNA结合蛋白的深度学习模型. 该模型利用2阶段卷积神经网络探测蛋白质序列的功能域，利用长短期记忆网络获得序列的定长特征表示并且能够学习到功能域之间的长短期依赖关系.预测算法中所用到的特征均是通过“学习”自动获得，克服了传统机器学习中特征选择过程过多的人工干预. 实验结果表明:模型在处理大规模序列数据时具有明显的优势.

Abstract: RNA-binding proteins (RNA-BPs) play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions. Predicting functions of these proteins from primary amino acids sequences are becoming one of the major challenges in functional annotation of genomes. Traditional prediction methods often devote themselves to extracting physicochemical features from sequences but ignoring motif information and location information between motifs. Meanwhile, the small scale of data volumes and large noises in training data result in lower accuracy and reliability of predictions. In this paper, we propose a new deep learning based model to predict RNA-binding proteins from primary sequences. The model utilizes two stages of convolutional neutral network(CNN) to detect the function domain of protein sequences, and long short-term memory neural network(LSTM) to obtain the length-fixed feature representation of sequences and learn long short-term dependencies between function domains of protein sequences. It overcomes more human intervention in feature selection procedure than in traditional machine learning method, since all features are learned automatically. The experimental results show its priority in processing large scale of sequence data.

HTML全文

参考文献(0)

施引文献

资源附件(0)