ISSN 1000-1239 CN 11-1777/TP

• 人工智能 •

### 基于病毒传播网络的基因序列表示学习

1. (国防科技大学系统工程学院 长沙 410073) (yang_ma_cn@163.com)
• 出版日期: 2021-08-01
• 基金资助:
国家自然科学基金项目(62073333);湖南省研究生科研创新项目(CX20200069)

### Gene Sequence Representation Learning Based on Virus Transmission Network

Ma Yang, Liu Zeyi, Liang Xingxing, Cheng Guangquan, Yang Fangjie, Cheng Qing, Liu Zhong

1. (College of System Engineering, National University of Defense Technology, Changsha 410073)
• Online: 2021-08-01
• Supported by:
This work was supported by the National Natural Science Foundation of China (62073333) and Graduate Research and Innovation Project of Hunan Province (CX20200069).

Abstract: There always exists non-coding and missing sequence in obtained gene sequence data. The existing gene sequence representation methods extract features from high dimension gene sequence mostly through manual process, which usually are computationally expensive. What’s more, the precision of prediction heavily relies on how to utilize the biology background knowledge. In this work, we construct a gene sequence representation method based on graph context information in virus transmission network. After coding the target node’s virus sequence, we use attention mechanism to aggregate the neighbor nodes’ gene sequence information, and thus we can achieve a new representation of the target node’s gene sequence. The gene sequence representation model is optimized based on the fact that the similarity of gene sequence of neighbor nodes is higher than that of non-neighbor nodes. The new representation after being well trained not only extracts the feature of sequence exactly, but also reduces the dimension of gene sequence greatly and improve the computation efficiency. We first train the gene sequence representation model respectively on a simulation transmission network, SARS-CoV-2 and HIV transmission network, and then predict the un-sampled infections in each transmission network. The experimental results show the effectiveness of our model, and its performance is better than other models. What’s more, its success on effectively predicting the un-sampled infections in virus transmission network has a certain practical significance in the epidemiological investigation area.