SCoT-SQL:一种基于思维链引导的Text-to-SQL数据合成方法

任千昊; 刘睿珩; 张宇

doi:10.7544/issn1000-1239.202550826

SCoT-SQL:一种基于思维链引导的Text-to-SQL数据合成方法

SCoT-SQL:A Chain-of-Thought Guided Approach for Text-to-SQL Data Synthesis

摘要

摘要: 在Text-to-SQL任务中，模型一般需要大量的数据进行训练，而标注这些数据需要较多的时间和人力成本。在此背景下，基于LLM的数据增强是缓解训练数据匮乏的一种有效方法，然而当前Text-to-SQL领域的LLM数据增强方法大多存在对原始数据集的依赖，并且其合成数据缺乏可靠性与可解释性。针对上述问题，该文提出了一种思维链引导的数据合成方法SCoT-SQL。该方法通过整合自然语言指令与结构化知识来指引大模型减少数据合成中的语义错误，并结合模型自一致性与执行验证机制来提升合成数据的质量。同时，该方法结合思维链与模型自反馈机制对合成数据进行校准并补充数据合成的推理链以进一步增强数据可靠性与可解释性。实验表明，SCoT-SQL方法在KaggleDBQA与ScienceBenchmark数据集上的执行准确率相较于之前最先进的数据合成方法分别提升了6.5%与3%。

Abstract: Traditional Text-to-SQL tasks typically require large-scale annotated data for model training, which incurs significant time and labor costs. To address data scarcity, LLM-based data augmentation has emerged as an effective solution. However, most existing methods in Text-to-SQL heavily rely on the original dataset, and their synthetic data often lacks reliability and interpretability. This study proposed a chain-of-thought-guided data synthesis method, SCoT-SQL. The approach integrated natural language instructions with symbolic knowledge to guide LLMs in reducing semantic errors during data synthesis. It also incorporated model self-consistency and execution verification to enhance data reliability. Additionally, SCoT-SQL employed chain-of-thought reasoning and model self-feedback to calibrate synthetic data while supplementing decision trajectories, further improving reliability and interpretability. Experiments showed that SCoT-SQL achieved EX score improvements of 6.5% and 3% on KaggleDBQA and ScienceBenchmark datasets, respectively, outperforming previous state-of-the-art data synthesis methods.

HTML全文

参考文献(0)

施引文献

资源附件(0)