Abstract:
With the rapid growth of the open source software (OSS) ecosystem, OSS adoption has become a mainstream development practice. README files serve as a critical resource for understanding and reusing OSS. Although recent research explores automatic README generation and completion, existing approaches face limitations in cross-language applicability, neglect of code structure, and susceptibility to hallucination and subjectivity. To address these challenges, this paper proposes RMancer, a dual-stage README generation framework that integrates large language models (LLMs) with code structure modeling. In the first stage, RMancer introduces a prompt-guided structured information extraction method, enhanced with static analysis to construct high-quality training data, enabling the model to accurately capture file-level functional descriptions, dependency relations, and program entry points. In the second stage, RMancer applies a topology-based sorting strategy derived from the call graph to reconstruct execution logic and build structured input contexts. It further adopts a multi-task supervision mechanism to jointly learn document structure and content generation, enhancing logical consistency and objectivity. A post-generation standardization strategy is also incorporated to ensure the formatting and factuality of the generated README files. Evaluations on
16692 OSS projects show that RMancer consistently outperforms state-of-the-art methods in both information extraction and README generation. It achieves an average
F1-score improvement of 2.34% across key fields and gains 1.37% on average in BLEU, METEOR, and ROUGE-L. RMancer also leads on AlignScore and G-Eval metrics, with superior objectivity and redundancy control, confirming the effectiveness of its structure-aware and multi-task optimization strategies.