ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2021, Vol. 58 ›› Issue (8): 1642-1654.doi: 10.7544/issn1000-1239.2021.20210287

所属专题: 2021人工智能前沿进展专题

• 人工智能 • 上一篇    下一篇

基于病毒传播网络的基因序列表示学习

马扬,刘泽一,梁星星,程光权,阳方杰,成清,刘忠   

  1. (国防科技大学系统工程学院 长沙 410073) (yang_ma_cn@163.com)
  • 出版日期: 2021-08-01
  • 基金资助: 
    国家自然科学基金项目(62073333);湖南省研究生科研创新项目(CX20200069)

Gene Sequence Representation Learning Based on Virus Transmission Network

Ma Yang, Liu Zeyi, Liang Xingxing, Cheng Guangquan, Yang Fangjie, Cheng Qing, Liu Zhong   

  1. (College of System Engineering, National University of Defense Technology, Changsha 410073)
  • Online: 2021-08-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (62073333) and Graduate Research and Innovation Project of Hunan Province (CX20200069).

摘要: 基因序列数据中往往存在大量的非编码和缺失序列,现有的基因序列表示大多通过人工方法对高维的基因序列进行特征提取,不仅非常耗时且成功的预测很大程度依赖于生物学知识的正确利用.基于病毒传播网络构建了一种基于图上下文信息的基因序列表示方法,对目标节点病毒序列进行编码后,使用注意力机制对其邻居节点的序列信息进行聚合,从而得到目标节点病毒序列的新的低维表示.进而依据病毒传播网络中相邻节点的基因序列相似性高于不相邻节点的特征,对基因序列表示模型进行优化,训练后得到的新的表示不仅可以有效表达基因序列的特征,同时极大地降低了序列的维度,提高了计算效率.分别在仿真病毒传播网络、新型冠状病毒和艾滋病毒传播网络数据上训练基因序列表示模型,并在相应的网络上进行未采样感染者发现任务.实验结果充分验证了模型的有效性,与其他方法的比较证明了模型的高效性,模型可以有效地在病毒传播网络上发现未采样感染者,这在流行病调查领域也具有一定的实际意义.

关键词: 复杂网络, 基因表示, 机器学习, 图神经网络, 病毒传播

Abstract: There always exists non-coding and missing sequence in obtained gene sequence data. The existing gene sequence representation methods extract features from high dimension gene sequence mostly through manual process, which usually are computationally expensive. What’s more, the precision of prediction heavily relies on how to utilize the biology background knowledge. In this work, we construct a gene sequence representation method based on graph context information in virus transmission network. After coding the target node’s virus sequence, we use attention mechanism to aggregate the neighbor nodes’ gene sequence information, and thus we can achieve a new representation of the target node’s gene sequence. The gene sequence representation model is optimized based on the fact that the similarity of gene sequence of neighbor nodes is higher than that of non-neighbor nodes. The new representation after being well trained not only extracts the feature of sequence exactly, but also reduces the dimension of gene sequence greatly and improve the computation efficiency. We first train the gene sequence representation model respectively on a simulation transmission network, SARS-CoV-2 and HIV transmission network, and then predict the un-sampled infections in each transmission network. The experimental results show the effectiveness of our model, and its performance is better than other models. What’s more, its success on effectively predicting the un-sampled infections in virus transmission network has a certain practical significance in the epidemiological investigation area.

Key words: complex networks, gene representation, machine learning, graph neural networks, virus transmission

中图分类号: