Structured Query Language (SQL) is complex. Without GenAI Studio, you would need an indepth understanding of databases and metadata. GenAI Studio aims to generate SQL queries from natural language datasets and convert the queries into semantically-correct SQL for you.
By following these step-by-step instructions, you’ll see how easy it is to build a plain-text-to-SQL solution that generates semantically-correct SQL queries for you from your dataset. All you have to do is create a new project, connect to a Hugging Face dataset, load a base model, and begin prompting in the playground. You can even fine-tune your model to make it even more customized!
When you have completed this tutorial, you will have created a project and prompted a chat model to generate SQL queries. We’ll cover all of the following topics:
Creating a Project
Importing a Pre-Loaded Hugging Face Dataset
Linking the Dataset to Your Project
Creating a Text-to-SQL Prompt
Generating SQL Queries from Plain Text
GenAI Studio guides you through going from basic prompt design to pretty good model performance. Its built-in guardrails help prevent memory issues and help prevent you from choosing the wrong parameters.
If you’re starting from the Machine Learning Development Environment cluster home page, selecting GenAI will open the Generative AI Studio home page in your browser. Similarly, from your project home page, selecting MLDE will open Machine Learning Development Environment in your browser.
We’ll import a dataset that is particularly useful for training a model how to create SQL queries from text because it was built for exactly that! From this dataset, you’ll see how to work with your own datasets because it is a good example.
Navigate to the Datasets tab.
Select New Dataset.
Provide the following Hugging Face dataset inputs:
Dataset name: determined-ai/text-to-sql-easy
Description: text to sql queries
Select Create Dataset.
About the Training Set
In machine learning, we use a training set to teach the model how to perform a task, a validation set to fine-tune the model and make sure it isn’t memorizing answers but actually learning, and a test set to evaluate the model’s performance on completely new, unseen data, ensuring it can effectively handle real-world requests to create SQL queries.
The text-to-sql-easy dataset is already split into train (10%), validation (10%), and test (10%). The dataset consists of the following features:
Instruction: The instruction column contains requests or questions written in everyday language and serves as the basis for generating SQL queries.
Input: The input column is the blueprint of our dataset and is defined as a collection of CREATE TABLE statements. These statements lay out how our data is organized–what tables exist, and what columns they contain.
Response: The response is the end product or target. It contains the actual SQL queries that have been expertly crafted to fetch the information asked for in the instruction column.
Your screen should look similar to this:
You now have a dataset ready for linking to snapshots in your project.
The snapshot playground is where you’ll interact with GenAI Studio’s base models to get an idea of how they perform with certain settings and prompts. Once you find a combination that works well, you can save the snapshot and use it to create a new model.
Navigate to the Snapshots tab from your project dashboard.
Select New Playground to open a new snapshot Playground.
Select a Foundation (Base) Model and Resource Pool#
First, you’ll need to select a foundation (base) model. You’ll find a curated list of foundation models to choose from. Selecting a foundation model is required because it serves as the underlying engine that drives the generative model’s performance.
Choose Select model.
In Model properties, select a chat model such as meta-llama/Llama-2-7b-chat-hf main.
Select a Resource Pool, e.g., a100.
Select Load.
When selecting a base model, you generally want to select the most appropriate model based on what you need the model to do in your specific project or task.
You are a helpful programmer assistant that excels at SQL. When prompted with a task and a definition of an SQL table, you respond with a SQL query to retrieve information from the table. Don’t explain your reasoning, only provide the SQL query.
Examples
Below is an example:
Instruction: {{instruction}}
SQL table: {{input}}
SQL query: {{response}}
After you’ve put everything in, the model will give its response. This is where you see how well it understood your request.
Select Generate.
Review the model’s response in the output.
Select Save Snapshot and provide a meaningful name for your snapshot, such as snapshot1.
Select Save.
Saving a snapshot of the playground lets us revisit this exact point in the training process without needing to start over. This is particularly useful for iterating on our model’s performance, because we can make adjustments and immediately see how they impact the results.
Congratulations!
Congratulations! You’ve just mastered the art of creating SQL queries with GenAI Studio!
Creating a Project: You set up your own project, establishing a workspace where all your SQL generation activities are centralized.
Importing and Linking Datasets: You imported the “determined-ai/text-to-sql-easy” dataset from Hugging Face, which is specifically tailored for generating SQL queries from plain text. You successfully linked this dataset to your project, setting the stage for practical, hands-on learning.
Crafting and Testing Prompts: Using the snapshot playground, you crafted a prompt that guides the model to convert plain text instructions into precise SQL queries. You’ve seen firsthand how to formulate requests that the model can understand and execute.
Generating SQL Queries: You initiated the generation of SQL queries by engaging with the base model, observing how the model interprets and responds to your inputs.
Snapshot Management: You created and saved snapshots, a crucial feature that allows you to save your work and revisit it anytime. This facilitates iterative improvements and fine-tuning of the model’s performance.
Iterative Learning and Improvement: Throughout the tutorial, GenAI Studio’s built-in guardrails helped you prevent common mistakes like memory overloads or suboptimal parameter selection, ensuring a smooth and productive learning experience.