Non-Contiguous Code Refactoring: A Hybrid Approach of Static Analysis and Large Language Model
-
-
Abstract
The widespread adoption of large language model (LLM) in software engineering has made automated code refactoring, leveraging their powerful code comprehension and generation capabilities, a crucial direction for enhancing software quality and development efficiency. However, when refactoring non-contiguous code clones which arise from statement interleaving, reordering, and similar transformations, LLM faces core challenges: dispersed semantic context, difficulty in capturing critical dependencies, and susceptibility to “hallucination” errors. To address these challenges, we propose a novel method for non-contiguous code clone refactoring that integrates static analysis with LLM. Our method first efficiently and accurately identifies non-contiguous clones by combining program slicing with an algebraic classifier. Next, a context-aware refactoring opportunity identification algorithm determines the optimal refactoring targets for the LLM. Finally, a chain-of-thought few-shot prompting strategy guides the LLM to generate high-quality “extract method” refactoring suggestions, and a verification mechanism, inspired by metamorphic relations, validates the semantic and structural consistency of the generated results. Experiments on the open-source datasets Google Code Jam and BigCloneBench demonstrate that our proposed refactoring method reduces clone code by 66% to 71% in real-world projects like Junit. Furthermore, our detection method achieves an F1-score 2% to 18% higher than existing mainstream tools. On the Community Corpus-A refactoring opportunity identification benchmark, it reaches an F1-score of 0.415, surpassing the state-of-the-art tool GEMS by 7.5%, enhancing software quality.
-
-