Abstract:
With the advent of the big data era, various enterprises and organizations have accumulated massive amounts of data driven by the growing demands for business expansion and information. This data is typically stored in relational databases in structured or semi-structured formats. SQL, as a structured database query language, has long been widely used for data retrieval and processing. It provides professionals with an efficient way to interact with databases, thereby facilitating quicker and more convenient data sharing and analysis. With the exponential advancement of Large Language Models (LLMs), In-Context Learning (ICL) has shown significant promise for the Text-to-SQL task, where LLMs utilize reference examples to accurately generate SQL queries. To fully harness the potential of LLMs through ICL for text-to-SQL, we propose a new SQL generation pipeline comprising three parts: first, we propose a schema enhancement component to strengthen LLMs’ focus on question-related information. Second, we leverage the inherent structure of SQL to preselect semantically similar, aiding in the generation of initial pseudo-SQL queries. Third, a selection strategy is proposed for selecting reference examples, which considers the similarities between the question and the pseudo-SQL structure to enhance the accuracy of the generated SQL queries. We implement an iterative refinement process in both stages, employing a boosting-like mechanism that progressively improves selection until stabilization is achieved. Extensive experiments across various models and datasets demonstrate the effectiveness of our approach.