高级检索

    基于代码感知与双阶段优化融合的README生成大模型框架

    An LLM-Based Framework for README Generation via Code-Aware Representation and Dual-Stage Optimization

    • 摘要: 随着开源软件生态的蓬勃发展,使用开源已成为当前开发的主流模式。其中,README是理解、复用开源软件的关键要素。然而,部分开源软件存在README文件缺失、信息不全以及结构不清晰等不规范问题,导致开发者难以理解和使用,降低开发效率。研究人员提出了多种README自动生成、补全方法,但这些方法仍然面临跨语言适用性不足、忽视代码结构信息、生成结果存在幻觉和主观性等挑战。因此,提出结合大语言模型与代码结构建模的双阶段README自动生成框架RMancer。第1个阶段中,RMancer设计了基于提示引导的结构化信息抽取方法,结合静态分析生成高质量训练数据,提升模型对文件级功能摘要、依赖关系和主程序入口等结构要素的感知能力。第2个阶段设计了基于调用图的拓扑排序策略,重构模块间的执行逻辑顺序,以构建结构化文档生成的上下文信息;同时,引入多任务监督机制,引导大模型联合学习文档段落结构与内容生成,提升输出文本的逻辑一致性与客观性。最后,RMancer通过标准化约束策略对生成结果进行格式规整与内容审校,确保文档的规范性与准确性。在包含 16692 个开源软件的测试集中,RMancer 在信息抽取与 README 生成2个子任务上均显著优于现有方法。具体而言,在信息抽取任务中,其在calls、entry和description 字段的 F1-score 相较最佳基线模型平均提升2.34%;在文档生成任务中,BLEU、METEOR 和 ROUGE-L三项指标相较最佳基线模型平均提升幅度为1.37%。此外,RMancer在AlignScore和G-Eval两种自动评估指标上表现最佳。同时,在内容客观性与冗余控制等关键维度上,RMancer仍保持领先表现,进一步验证了其结构感知与多任务优化策略的有效性。

       

      Abstract: With the rapid growth of the open source software (OSS) ecosystem, OSS adoption has become a mainstream development practice. README files serve as a critical resource for understanding and reusing OSS. Although recent research explores automatic README generation and completion, existing approaches face limitations in cross-language applicability, neglect of code structure, and susceptibility to hallucination and subjectivity. To address these challenges, this paper proposes RMancer, a dual-stage README generation framework that integrates large language models (LLMs) with code structure modeling. In the first stage, RMancer introduces a prompt-guided structured information extraction method, enhanced with static analysis to construct high-quality training data, enabling the model to accurately capture file-level functional descriptions, dependency relations, and program entry points. In the second stage, RMancer applies a topology-based sorting strategy derived from the call graph to reconstruct execution logic and build structured input contexts. It further adopts a multi-task supervision mechanism to jointly learn document structure and content generation, enhancing logical consistency and objectivity. A post-generation standardization strategy is also incorporated to ensure the formatting and factuality of the generated README files. Evaluations on 16692 OSS projects show that RMancer consistently outperforms state-of-the-art methods in both information extraction and README generation. It achieves an average F1-score improvement of 2.34% across key fields and gains 1.37% on average in BLEU, METEOR, and ROUGE-L. RMancer also leads on AlignScore and G-Eval metrics, with superior objectivity and redundancy control, confirming the effectiveness of its structure-aware and multi-task optimization strategies.

       

    /

    返回文章
    返回