基于Evo2模型微调的主粮作物基因组序列生成模型构建与应用

何曼曼; 刘龙正; 丁开锋; 葛明亮; 张雨琪; 任澳雨; 郭伟龙; 王耀君

doi:10.7544/issn1000-1239.202660003

基于Evo2模型微调的主粮作物基因组序列生成模型构建与应用

Construction and Application of Fine-Tuned Evo2-Based Models for Genomic Sequence Generation Models in Major Staple Crops

摘要

摘要: 为构建适用于主粮作物的基因组功能元件序列生成模型，并评估其跨物种泛化能力，以Evo2基因组语言模型为基础，针对多种主粮作物开展微调与生成性能评测研究。选取马铃薯、花生、籼稻（MH63、ZS97）、粳稻和小麦6个作物品种，围绕编码序列（CDS）、外显子、内含子、mRNA、5′非翻译区（5′ UTR）和3′非翻译区（3′ UTR）六类核心基因组功能元件，构建统一的条件生成与评测体系。实验结果表明，经作物数据微调后，Evo2 模型在所有功能元件上的生成性能均较零样本基线获得稳定提升，平均序列相似度绝对提升约 0.49 个百分点，对应相对提升约 0.79%，表明微调能够有效增强模型对作物基因组序列特征的适配能力。进一步的跨物种测试显示，模型在不同作物之间仍保持良好的泛化性能，各类功能元件的跨物种泛化差距均小于1 个百分点。其中，编码相关元件（CDS 和 mRNA）生成最为稳定，而5′ UTR对物种差异相对更为敏感。综上，验证了基于 Evo2 微调的基因组语言模型在主粮作物多功能元件序列生成任务中的有效性与跨物种适用性。补充生物信息学分析进一步表明，模型在ORF连续性等方面具有一定合理性，但在密码子偏好、UTR调控基序及内含子剪接规则等精细生物学特征上仍存在改进空间。

Abstract: To construct genomic functional-element sequence generation models for major staple crops and evaluate their cross-species generalization ability, this study investigates fine-tuning and performance assessment of the Evo2 genomic language model for conditional sequence generation. Six representative crop genomes, including potato, peanut, indica rice cultivars (MH63 and ZS97), japonica rice, and wheat, are selected. A unified conditional generation and evaluation framework is established for six core genomic functional elements, including coding sequences (CDS), exons, introns, mRNA, 5′ untranslated regions (5′ UTR), and 3′ untranslated regions (3′ UTR). Based on this framework, Evo2 models are fine-tuned on crop-specific genomic data, while the pretrained Evo2 model without fine-tuning is retained as a zero-shot baseline. Experimental results show that fine-tuning consistently improves generation performance across all functional elements and crop species, an average absolute gain of about 0.49 percentage points, corresponding to a relative gain of approximately 0.79%. Cross-species evaluation further demonstrates that the fine-tuned models maintain good generalization ability when applied to unseen crop species, and the cross-species generalization gap remains below 1 percentage point for all six functional elements. Among them, coding-related elements such as CDS and mRNA show the most stable generation performance across species, whereas 5′ UTR is more sensitive to species differences. Additional bioinformatics analyses indicate that the generated sequences exhibit some degree of biological plausibility in ORF continuity, but still show limitations in codon preference, regulatory motifs in UTRs, and splice-site rules in introns. Overall, the results verify the effectiveness and cross-species applicability of Evo2-based fine-tuned genomic language models for functional-element sequence generation in major staple crops, providing a potential technical basis for genome-assisted breeding and crop improvement.

HTML全文

参考文献(33)

施引文献

资源附件(0)