ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (1): 93-101.doi: 10.7544/issn1000-1239.2018.20160508

• 人工智能 • 上一篇    下一篇

一种只利用序列信息预测RNA结合蛋白的深度学习模型

李洪顺,于华,宫秀军   

  1. (天津大学计算机科学与技术学院 天津 300072) (天津市认知计算与应用重点实验室(天津大学) 天津 300072) (hongshunlee@foxmail.com)
  • 出版日期: 2018-01-01
  • 基金资助: 
    国家自然科学基金项目(61930007);国家“八六三”高技术研究发展计划基金项目(2015BA3005);国家“九七三”重点基础研究发展计划基金项目 (2013CB32930X)

A Deep Learning Model for Predicting RNA-Binding Proteins Only from Primary Sequences

Li Hongshun, Yu Hua, Gong Xiujun   

  1. (School of Computer Science and Technology, Tianjin University, Tianjin 300072) (Tianjin Key Laboratory of Cognitive Computing and Application (Tianjin University), Tianjin 300072)
  • Online: 2018-01-01

摘要: RNA结合蛋白在选择性剪贴、RNA编辑及甲基化等多种生物功能中发挥非常重要的作用,从氨基酸序列预测这些蛋白的功能成为基因组功能注释领域的重要挑战之一. 传统的预测方法往往从序列中提取氨基酸的理化特性作为初始特征,忽略了motif及motif之间的位置信息,同时由于训练数据规模小、噪声大,导致预测的精度及可信度降低. 在此提出了一种从序列预测RNA结合蛋白的深度学习模型. 该模型利用2阶段卷积神经网络探测蛋白质序列的功能域,利用长短期记忆网络获得序列的定长特征表示并且能够学习到功能域之间的长短期依赖关系.预测算法中所用到的特征均是通过“学习”自动获得,克服了传统机器学习中特征选择过程过多的人工干预. 实验结果表明:模型在处理大规模序列数据时具有明显的优势.

关键词: RNA结合蛋白, 卷积神经网络, 长短期记忆神经网络, 特征学习, 深度学习

Abstract: RNA-binding proteins (RNA-BPs) play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions. Predicting functions of these proteins from primary amino acids sequences are becoming one of the major challenges in functional annotation of genomes. Traditional prediction methods often devote themselves to extracting physicochemical features from sequences but ignoring motif information and location information between motifs. Meanwhile, the small scale of data volumes and large noises in training data result in lower accuracy and reliability of predictions. In this paper, we propose a new deep learning based model to predict RNA-binding proteins from primary sequences. The model utilizes two stages of convolutional neutral network(CNN) to detect the function domain of protein sequences, and long short-term memory neural network(LSTM) to obtain the length-fixed feature representation of sequences and learn long short-term dependencies between function domains of protein sequences. It overcomes more human intervention in feature selection procedure than in traditional machine learning method, since all features are learned automatically. The experimental results show its priority in processing large scale of sequence data.

Key words: RNA-binding proteins, convolutional neutral network (CNN), long short-term memory neural network (LSTM), feature learning, deep learning

中图分类号: