Using AI to generate data structures for PostgreSQL

Artificial Intelligence is the motto of the day. Everybody is talking about it, everybody seems to know what it means and the insane number of buzzwords floating around on the internet is just mind-boggling. It seems that in some areas of the industry, the number of slides and marketing videos exceeds the number of real-world use cases and sometimes even the amount of code that actually exists.

Table of Contents

However, as I tend to prefer the real PostgreSQL world over the fake marketing side of things, I decided to show some of the things we recently experimented with to give you some insights into what can be done with AI and to point out where the limitations are.

First step: Generating data models for PostgreSQL

One of the most obvious things one might do with AI is to help people create basic data models. This use case presents itself as a perfect fit for all the new LLMs (Large Language Models) including ChatGPT and similar models.

For this little demo, I used a local deployment of “llama3”, which has been provided by Meta. It is a general-purpose model, but it happens to do a decent job and produces reasonable drafts that can serve as the basis for expanding your SQL code quickly and efficiently.

How can anybody make use of an LLM when dealing with PostgreSQL? Well, the first thing to do is to install an Open Source tool called “ollama”. It will help you to download, deploy, and run the language model of your choice. Ollama runs as a server and offers a really simple interface. I ran these models on my local Apple M1 chip. OS X is able to automatically include your GPU. On Linux, a bit more effort is needed to run on your NVIDIA chip, but you will also be awarded with better response times in general, assuming your are not utilizing an outdated chip.

In general, I found it useful to use Python to interact with AI models. However, for testing, the command line works just fine.

Generating PostgreSQL code using Ollama:

Before we dive into the Python code we can use the command line interface to generate some basic stuff:

[hs@MacBook-Air ~]$ ollama run llama3
>>> Create a data model for PostgreSQL including indexes to store addresses.

1 2	[hs@MacBook-Air ~]$ ollama run llama3 >>> Create a data model for PostgreSQL including indexes to store addresses.

As you can see Ollama can be started easily and we can simply ask a question.

What comes back is really interesting. Basically we get a markdown document which we can go through step by step:

Here is an example of a data model for storing addresses in PostgreSQL:
```
CREATE TABLE addresses (
id SERIAL PRIMARY KEY,
street TEXT NOT NULL,
city TEXT NOT NULL,
state TEXT NOT NULL,
zip_code INTEGER NOT NULL,
country TEXT NOT NULL
);



CREATE INDEX idx_addresses_street ON addresses USING BTREE (street);
CREATE INDEX idx_addresses_city ON addresses USING BTREE (city);
CREATE INDEX idx_addresses_state ON addresses USING BTREE (state);
CREATE INDEX idx_addresses_zip_code ON addresses USING BTREE (zip_code);
```

Here is an example of a data model for storing addresses in PostgreSQL:

```

CREATE TABLE addresses (

id SERIAL PRIMARY KEY,

street TEXT NOT NULL,

city TEXT NOT NULL,

state TEXT NOT NULL,

zip_code INTEGER NOT NULL,

country TEXT NOT NULL

);

CREATE INDEX idx_addresses_street ON addresses USING BTREE (street);

CREATE INDEX idx_addresses_city ON addresses USING BTREE (city);

CREATE INDEX idx_addresses_state ON addresses USING BTREE (state);

CREATE INDEX idx_addresses_zip_code ON addresses USING BTREE (zip_code);

```

The first part already contains a warm welcome and a reasonably good SQL model which is decently indexed.

But, there is more: We will also get a full explanation of what the model has already done for us. Check it out. Let me explain the design choices:

1. **Table name**: I chose `addresses` as the table name, 
       which is self-explanatory.
2. **Columns**:
	* `id`: A serial primary key column to uniquely identify 
           each address record. This will automatically increment 
           for each new row inserted into the table.
	* `street`, `city`, `state`, and `country`: These are 
           the main attributes of an address, which we want 
           to store.
	* `zip_code`: This is a numerical attribute that can 
           be used to quickly filter or search for addresses 
           within a specific geographic region.
3. **Indexes**: I created four indexes to improve query performance:
	+ `idx_addresses_street`: An index on the `street` 
           column to enable fast searching and filtering 
           by street name.
	+ `idx_addresses_city`: An index on the `city` 
            column to support queries that filter or sort 
            addresses by city.
	+ `idx_addresses_state`: An index on the `state` 
            column to facilitate searches or aggregations based on state.
	+ `idx_addresses_zip_code`: An index on the `zip_code` 
            column to speed up queries that involve searching 
            or filtering by zip code.

1. **Table name**: I chose `addresses` as the table name,

which is self-explanatory.

2. **Columns**:

* `id`: A serial primary key column to uniquely identify

each address record. This will automatically increment

for each new row inserted into the table.

* `street`, `city`, `state`, and `country`: These are

the main attributes of an address, which we want

to store.

* `zip_code`: This is a numerical attribute that can

be used to quickly filter or search for addresses

within a specific geographic region.

3. **Indexes**: I created four indexes to improve query performance:

+ `idx_addresses_street`: An index on the `street`

column to enable fast searching and filtering

by street name.

+ `idx_addresses_city`: An index on the `city`

column to support queries that filter or sort

addresses by city.

+ `idx_addresses_state`: An index on the `state`

column to facilitate searches or aggregations based on state.

+ `idx_addresses_zip_code`: An index on the `zip_code`

column to speed up queries that involve searching

or filtering by zip code.

These indexes will help with common use cases, such as:

Searching for addresses within a specific street range
Retrieving all addresses in a particular city or state
Filtering addresses based on their zip code

Note: I used the BTREE index type, which is suitable for most use cases. If you expect very large datasets or need more advanced

indexing features, you may want to consider using other index types (e.g., Hash, GiST) or more complex indexing strategies.

Please adjust this design according to your specific requirements and data constraints!

What we got here is really good information that can already serve as some kind of documentation. At the very least we can use it to adjust and to add more information. Clearly, the amount of effort required to come up with this content is considerably lower than if we had produced all that by hand. Usually, the code works in PostgreSQL out of the box, but we have also seen that the AI creates tables in the wrong order in case foreign keys are present. Therefore, it may make sense to create the tables in the first step and then ask the AI to create and generate keys and indexes in a second phase.

Using Python and LLAMA3 to generate SQL code

The following code contains a small Python program to interact with the AI. It sends a sentence to the large language model, which returns a variety of content, including text and code. Therefore, we should extract the SQL part from the result because we might want to deploy that code and test it directly:

#!/usr/bin/python3

import re
import ollama

def get_model_result(prompt, model):
	res = ollama.chat(model, messages=[{
    	'role': 'user',
    	'content': prompt,
    	"format": "json",
    	"raw": "False"
	}])

	return res['message']['content']

def extract_code_from_markdown(markdown_text):
	code_blocks = []
	for block in re.findall(r'```(.*?)```', markdown_text, re.DOTALL):
    	if block.startswith("sql"):
        	code_blocks.append(block[3:])
    	else:
        	code_blocks.append(block)
	result = ''.join(code_blocks)
	return result

model_name = "llama3"
user_input = """
  Make me an SQL model for PostgreSQL including indexes
  and constraints which can store:
  customers, products, prices and product categories
"""

j = get_model_result(user_input, model_name)
res = extract_code_from_markdown(j)

print(res)

#!/usr/bin/python3

import re

import ollama

def get_model_result(prompt, model):

res = ollama.chat(model, messages=[{

'role': 'user',

'content': prompt,

"format": "json",

"raw": "False"

}])

return res['message']['content']

def extract_code_from_markdown(markdown_text):

code_blocks = []

for block in re.findall(r'```(.*?)```', markdown_text, re.DOTALL):

if block.startswith("sql"):

code_blocks.append(block[3:])

else:

code_blocks.append(block)

result = ''.join(code_blocks)

return result

model_name = "llama3"

user_input = """

Make me an SQL model for PostgreSQL including indexes

and constraints which can store:

customers, products, prices and product categories

"""

j = get_model_result(user_input, model_name)

res = extract_code_from_markdown(j)

print(res)

The get_model_result function simply takes user input and sends it to our desired model. The second function, (extract_code_from_markdown), dissects the result and extracts the SQL code. Of course, one might also write some code to extract the rest of the result, but for now, this seems easy enough. Finally, we define a question we want to ask, run the model, and display the result.

What we get might look like this:

CREATE TABLE customers (
	id SERIAL PRIMARY KEY,
	name VARCHAR(255) NOT NULL,
	email VARCHAR(255) UNIQUE NOT NULL,
	phone VARCHAR(20),
	address VARCHAR(255)
);

CREATE TABLE product_categories (
	id SERIAL PRIMARY KEY,
	name VARCHAR(255) NOT NULL
);

CREATE TABLE products (
	id SERIAL PRIMARY KEY,
	name VARCHAR(255) NOT NULL,
	description TEXT,
	category_id INTEGER NOT NULL 
REFERENCES product_categories(id),
	created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
	updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE prices (
	id SERIAL PRIMARY KEY,
	product_id INTEGER NOT NULL REFERENCES products(id),
	price DECIMAL(10, 2) NOT NULL,
	effective_date DATE NOT NULL DEFAULT CURRENT_DATE,
	expires_date DATE,
	PRIMARY KEY (product_id, effective_date)
);

CREATE INDEX idx_customers_email ON customers (email);

CREATE INDEX idx_products_category_id 
ON products (category_id);

CREATE INDEX idx_prices_product_id_date 
ON prices (product_id, effective_date);

CREATE TABLE customers (

id SERIAL PRIMARY KEY,

name VARCHAR(255) NOT NULL,

email VARCHAR(255) UNIQUE NOT NULL,

phone VARCHAR(20),

address VARCHAR(255)

);

CREATE TABLE product_categories (

id SERIAL PRIMARY KEY,

name VARCHAR(255) NOT NULL

);

CREATE TABLE products (

id SERIAL PRIMARY KEY,

name VARCHAR(255) NOT NULL,

description TEXT,

category_id INTEGER NOT NULL

REFERENCES product_categories(id),

created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP

);

CREATE TABLE prices (

id SERIAL PRIMARY KEY,

product_id INTEGER NOT NULL REFERENCES products(id),

price DECIMAL(10, 2) NOT NULL,

effective_date DATE NOT NULL DEFAULT CURRENT_DATE,

expires_date DATE,

PRIMARY KEY (product_id, effective_date)

);

CREATE INDEX idx_customers_email ON customers (email);

CREATE INDEX idx_products_category_id

ON products (category_id);

CREATE INDEX idx_prices_product_id_date

ON prices (product_id, effective_date);

Wow, this result is really nice and helps us to get some inspiration to expand the data model even more. Of course, nothing is perfect, but come on! All we fed to the model is one sentence. Do you expect a human to create something better if there is zero customer interaction, zero knowledge about the subject, and no background knowledge?

Our research shows that there is certainly a lot of potential in the field and we are in the final stages of adding those features to tools such as CYPEX.

The beauty here is that you can run the model locally, and there is no need to procure software or purchase subscriptions, which have value in a secure environment.

Finally …

If you want to find some more information about a more traditional method to dealing with trends and similar topics, I suggest reading my blog about string encoding in SQL to deal with time series.

Using AI to generate data structures for PostgreSQL

First step: Generating data models for PostgreSQL

Generating PostgreSQL code using Ollama:

Using Python and LLAMA3 to generate SQL code

Finally …

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts