Skip to main navigation Skip to search Skip to main content

Towards Automating Domain-Specific Data Generation for Text-to-SQL: A Comprehensive Approach

  • Salmane Chafik*
  • , Saad Ezzini
  • , Ismail Berrada
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

As software systems increasingly rely on natural language interfaces, ensuring the reliability of these systems is crucial. One critical component is the ability to accurately translate natural language queries into corresponding SQL queries, a field known as Text-to-SQL. However, the scarcity of high-quality, large-scale, and domain-specific Text-to-SQL datasets hinders the development of reliable and robust models. To tackle these challenges, we propose SelectCraft, a novel automatic generation approach designed to create realistic Text-to-SQL datasets tailored to specific domains. Our method leverages existing databases and their structures to generate complex text-SQL pairs that mirror real-world usage scenarios. As a proof of concept, we have successfully generated a substantial financial Text-to-SQL dataset, denominated as BanQies, encompassing over 1 million samples utilizing our proposed approach. Moreover, we introduce BanQL, a new large language model (LLM) based on StarCoder2, a state-of-the-art code-based LLM, and fine-tuned on our newly created dataset. We evaluate BanQL performance against several state-of-the-art models, demonstrating significant enhancements in accuracy and generalizability, highlighting the advantages of incorporating domain-specific data in Text-to-SQL tasks. We firmly believe that our contributions have the potential to improve the overall reliability of Text-to-SQL software systems.

Original languageEnglish
Article number96
JournalACM Transactions on Software Engineering and Methodology
Volume35
Issue number4
DOIs
StatePublished - Apr 2026

Bibliographical note

Publisher Copyright:
© 2026 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Keywords

  • Code Generation
  • Data Generation
  • SQL-to-Text
  • Synthetic Data
  • Text-to-SQL

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Towards Automating Domain-Specific Data Generation for Text-to-SQL: A Comprehensive Approach'. Together they form a unique fingerprint.

Cite this