基于大语言模型的2阶段迭代模式增强SQL生成

张伟; 周东傲; 宫永顺; 尹义龙

doi:10.7544/issn1000-1239.202550355

基于大语言模型的2阶段迭代模式增强SQL生成

Iterative Two-Stage Schema-Enhanced SQL Generation with Large Language Models

摘要

摘要: 随着大数据时代的到来，各种企业和组织在日益增长的业务拓展以及信息需求的推动下积累了海量的数据。这些数据通常以结构化或半结构化的形式存储在对应的关系型数据库中。SQL作为一种结构化的数据库查询语言，长期以来被广泛应用于数据的检索和处理。它为相关专业技术人员提供了一种高效的方式来与数据库进行交互，从而能够更快捷方便地实现对于数据的分析。随着大语言模型（large language models，LLMs）的快速发展，上下文学习（in-context learning，ICL）在文本生成SQL（text-to-SQL）任务中展现出巨大潜力。ICL能够使LLMs通过利用参考示例生成准确的SQL查询。为了充分利用LLMs在ICL机制下进行文本生成SQL，提出了一个新的SQL生成流程，该流程包括以下3个部分：首先，提出了迭代式的模式增强，以强化LLMs对解决问题相关信息的关注。其次，利用SQL的骨架结构进行语义相似度筛选，从而辅助生成伪SQL（pseudo-SQL）查询。最后设计了一种参考示例选择策略，该策略结合问题与伪SQL骨架之间的相似性，以提高生成SQL查询的准确性。同时，在2个关键阶段中都引入了迭代优化过程，以逐步优化示例选择策略，直至达到稳定状态。在多个大语言模型和数据集上进行了广泛的实验，验证了所提出方法的有效性。

Abstract: With the advent of the big data era, various enterprises and organizations have accumulated massive amounts of data driven by the growing demands for business expansion and information. This data is typically stored in relational databases in structured or semi-structured formats. SQL, as a structured database query language, has long been widely used for data retrieval and processing. It provides professionals with an efficient way to interact with databases, thereby facilitating quicker and more convenient data sharing and analysis. With the exponential advancement of Large Language Models (LLMs), In-Context Learning (ICL) has shown significant promise for the Text-to-SQL task, where LLMs utilize reference examples to accurately generate SQL queries. To fully harness the potential of LLMs through ICL for text-to-SQL, we propose a new SQL generation pipeline comprising three parts: first, we propose a schema enhancement component to strengthen LLMs’ focus on question-related information. Second, we leverage the inherent structure of SQL to preselect semantically similar, aiding in the generation of initial pseudo-SQL queries. Third, a selection strategy is proposed for selecting reference examples, which considers the similarities between the question and the pseudo-SQL structure to enhance the accuracy of the generated SQL queries. We implement an iterative refinement process in both stages, employing a boosting-like mechanism that progressively improves selection until stabilization is achieved. Extensive experiments across various models and datasets demonstrate the effectiveness of our approach.

HTML全文

参考文献(31)

施引文献

资源附件(0)