ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2021, Vol. 58 ›› Issue (8): 1642-1654.doi: 10.7544/issn1000-1239.2021.20210287

Special Issue: 2021人工智能前沿进展专题

Previous Articles     Next Articles

Gene Sequence Representation Learning Based on Virus Transmission Network

Ma Yang, Liu Zeyi, Liang Xingxing, Cheng Guangquan, Yang Fangjie, Cheng Qing, Liu Zhong   

  1. (College of System Engineering, National University of Defense Technology, Changsha 410073)
  • Online:2021-08-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (62073333) and Graduate Research and Innovation Project of Hunan Province (CX20200069).

Abstract: There always exists non-coding and missing sequence in obtained gene sequence data. The existing gene sequence representation methods extract features from high dimension gene sequence mostly through manual process, which usually are computationally expensive. What’s more, the precision of prediction heavily relies on how to utilize the biology background knowledge. In this work, we construct a gene sequence representation method based on graph context information in virus transmission network. After coding the target node’s virus sequence, we use attention mechanism to aggregate the neighbor nodes’ gene sequence information, and thus we can achieve a new representation of the target node’s gene sequence. The gene sequence representation model is optimized based on the fact that the similarity of gene sequence of neighbor nodes is higher than that of non-neighbor nodes. The new representation after being well trained not only extracts the feature of sequence exactly, but also reduces the dimension of gene sequence greatly and improve the computation efficiency. We first train the gene sequence representation model respectively on a simulation transmission network, SARS-CoV-2 and HIV transmission network, and then predict the un-sampled infections in each transmission network. The experimental results show the effectiveness of our model, and its performance is better than other models. What’s more, its success on effectively predicting the un-sampled infections in virus transmission network has a certain practical significance in the epidemiological investigation area.

Key words: complex networks, gene representation, machine learning, graph neural networks, virus transmission

CLC Number: