基于病毒传播网络的基因序列表示学习

马扬; 刘泽一; 梁星星; 程光权; 阳方杰; 成清; 刘忠

doi:10.7544/issn1000-1239.2021.20210287

基于病毒传播网络的基因序列表示学习

Gene Sequence Representation Learning Based on Virus Transmission Network

摘要

摘要: 基因序列数据中往往存在大量的非编码和缺失序列，现有的基因序列表示大多通过人工方法对高维的基因序列进行特征提取，不仅非常耗时且成功的预测很大程度依赖于生物学知识的正确利用.基于病毒传播网络构建了一种基于图上下文信息的基因序列表示方法，对目标节点病毒序列进行编码后，使用注意力机制对其邻居节点的序列信息进行聚合，从而得到目标节点病毒序列的新的低维表示.进而依据病毒传播网络中相邻节点的基因序列相似性高于不相邻节点的特征，对基因序列表示模型进行优化，训练后得到的新的表示不仅可以有效表达基因序列的特征，同时极大地降低了序列的维度，提高了计算效率.分别在仿真病毒传播网络、新型冠状病毒和艾滋病毒传播网络数据上训练基因序列表示模型，并在相应的网络上进行未采样感染者发现任务.实验结果充分验证了模型的有效性，与其他方法的比较证明了模型的高效性，模型可以有效地在病毒传播网络上发现未采样感染者，这在流行病调查领域也具有一定的实际意义.

Abstract: There always exists non-coding and missing sequence in obtained gene sequence data. The existing gene sequence representation methods extract features from high dimension gene sequence mostly through manual process, which usually are computationally expensive. What’s more, the precision of prediction heavily relies on how to utilize the biology background knowledge. In this work, we construct a gene sequence representation method based on graph context information in virus transmission network. After coding the target node’s virus sequence, we use attention mechanism to aggregate the neighbor nodes’ gene sequence information, and thus we can achieve a new representation of the target node’s gene sequence. The gene sequence representation model is optimized based on the fact that the similarity of gene sequence of neighbor nodes is higher than that of non-neighbor nodes. The new representation after being well trained not only extracts the feature of sequence exactly, but also reduces the dimension of gene sequence greatly and improve the computation efficiency. We first train the gene sequence representation model respectively on a simulation transmission network, SARS-CoV-2 and HIV transmission network, and then predict the un-sampled infections in each transmission network. The experimental results show the effectiveness of our model, and its performance is better than other models. What’s more, its success on effectively predicting the un-sampled infections in virus transmission network has a certain practical significance in the epidemiological investigation area.

HTML全文

参考文献(0)

施引文献

资源附件(0)