Abstract:
Traditional Text-to-SQL tasks typically require large-scale annotated data for model training, which incurs significant time and labor costs. To address data scarcity, LLM-based data augmentation has emerged as an effective solution. However, most existing methods in Text-to-SQL heavily rely on the original dataset, and their synthetic data often lacks reliability and interpretability. This study proposed a chain-of-thought-guided data synthesis method, SCoT-SQL. The approach integrated natural language instructions with symbolic knowledge to guide LLMs in reducing semantic errors during data synthesis. It also incorporated model self-consistency and execution verification to enhance data reliability. Additionally, SCoT-SQL employed chain-of-thought reasoning and model self-feedback to calibrate synthetic data while supplementing decision trajectories, further improving reliability and interpretability. Experiments showed that SCoT-SQL achieved EX score improvements of 6.5% and 3% on KaggleDBQA and ScienceBenchmark datasets, respectively, outperforming previous state-of-the-art data synthesis methods.