高级检索

    基于Evo2模型微调的主粮作物基因组序列生成模型构建与应用

    Construction and Application of a Fine-Tuned Evo2-Based Model for Genomic Sequence Generation in Major Crops

    • 摘要: 为构建适用于主粮作物的基因组功能元件序列生成模型,并评估其跨物种泛化能力,本文以Evo2基因组语言模型为基础,针对多种主粮作物开展微调与生成性能评测研究。选取马铃薯、花生、籼稻(MH63、ZS97)、粳稻和小麦六个作物品种,围绕编码序列(CDS)、外显子、内含子、mRNA、5’非翻译区(5’ UTR)和3’非翻译区(3’ UTR)六类核心基因组功能元件,构建统一的条件生成与评测体系。实验结果表明,经作物数据微调后,Evo2 模型在所有功能元件上的生成性能均较零样本基线获得稳定提升,平均序列相似度提升约 0.49%,表明微调能够有效增强模型对作物基因组序列特征的适配能力。进一步的跨物种测试显示,模型在不同作物之间仍保持良好的泛化性能,各类功能元件的跨物种性能衰减均小于1%。其中,编码相关元件(CDS 和 mRNA)生成最为稳定,而5’ UTR对物种差异相对更为敏感。综上,本文验证了基于 Evo2 微调的基因组语言模型在主粮作物多功能元件序列生成任务中的有效性与跨物种适用性。补充生物信息学分析进一步表明,模型在ORF连续性等方面具有一定合理性,但在密码子偏好、UTR调控基序及内含子剪接规则等精细生物学特征上仍存在改进空间。

       

      Abstract: To construct genomic functional-element sequence generation models for major staple crops and evaluate their cross-species generalization ability, this study investigates fine-tuning and performance assessment of the Evo2 genomic language model for conditional sequence generation. Six representative crop genomes, including potato, peanut, indica rice cultivars MH63 and ZS97, japonica rice, and wheat, were selected. A unified conditional generation and evaluation framework was established for six core genomic functional elements, including coding sequences (CDS), exons, introns, mRNA, 5’ untranslated regions (5’ UTR), and 3’ untranslated regions (3’ UTR). Based on this framework, Evo2 models were fine-tuned on crop-specific genomic data, while the pretrained Evo2 model without fine-tuning was retained as a zero-shot baseline. Experimental results show that fine-tuning consistently improves generation performance across all functional elements and crop species, with an average sequence-similarity gain of approximately 0.49% over the zero-shot baseline. Cross-species evaluation further demonstrates that the fine-tuned models maintain good generalization ability when applied to unseen crop species, and the performance degradation remains below 1% for all six functional elements. Among them, coding-related elements such as CDS and mRNA show the most stable generation performance across species, whereas 5’ UTR is more sensitive to species differences. Additional bioinformatics analyses indicate that the generated sequences exhibit some degree of biological plausibility in ORF continuity, but still show limitations in codon preference, regulatory motifs in UTRs, and splice-site rules in introns. Overall, the results verify the effectiveness and cross-species applicability of Evo2-based fine-tuned genomic language models for functional-element sequence generation in major crops, providing a potential technical basis for genome-assisted breeding and crop improvement.

       

    /

    返回文章
    返回