@clem on Hugging Face: "Introducing https://huggingface.co/datasets/gretelai/synthetic_text_to

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

clem

posted an update Apr 4

Post

2531

Introducing gretelai/synthetic_text_to_sql by https://huggingface.co/gretelai

It stands as the largest and most diverse synthetic Text-to-SQL dataset available to-date.

The dataset includes:

- 105,851 records partitioned into 100,000 train and 5,851 test records
~23M total tokens, including ~12M SQL tokens
- Coverage across 100 distinct domains/verticals
- Comprehensive array of SQL tasks: data definition, retrieval, manipulation, analytics & reporting
- Wide range of SQL complexity levels, including subqueries, single joins, multiple joins, aggregations, window functions, set operations
- Database context, including table and view create statements
- Natural language explanations of what the SQL query is doing
- Contextual tags to optimize model training

Blogpost: https://gretel.ai/blog/synthetic-text-to-sql-dataset
Dataset: gretelai/synthetic_text_to_sql

ichigoberry

Apr 4

Interesting, thanks for sharing, have been looking for something like this. Interesting approach @ llm-as-judge. I wonder if there are other ways to generate a synthetic dataset like this starting from SQL queries that work and having strong llms describe the queries 🤔

In this post