JSON, the lastest step in database technology
© Laurenz Albe 2021

 

The comprehensive JSON support in PostgreSQL is one of its best-loved features. Many people – particularly those with a stronger background in Javascript programming than in relational databases – use it extensively. However, my experience is that the vast majority of people don’t use it correctly. That causes problems and unhappiness in the long run.

In this article, I will try to point out good and bad uses of JSON in PostgreSQL, and provide you with guidelines that you can follow.

A bad example

This data model exemplifies everything that you can do wrong:

/* this table is fine */
CREATE TABLE people (
   id bigint PRIMARY KEY,
   name text NOT NULL
);

INSERT INTO people VALUES
   (1, 'laurenz'),
   (2, 'julian'),
   (3, 'ants'),
   (4, 'kaarel');

/* this table is ill-defined */
CREATE TABLE rooms (
   id bigint PRIMARY KEY,
   data jsonb
);

INSERT INTO rooms VALUES
   (1,
    '{ "name": "Room 1",
       "reservations": [
           { "who": 1, "from": "2021-06-01 09:00:00", "to": "2021-06-01 10:00:00" },
           { "who": 3, "from": "2021-06-01 10:00:00", "to": "2021-06-01 11:30:00" },
           { "who": 2, "from": "2021-06-01 13:15:00", "to": "2021-06-01 14:45:00" },
           { "who": 1, "from": "2021-06-01 15:00:00", "to": "2021-06-01 16:00:00" }
        ]
     }'),
   (2,
    '{ "name": "Room 2",
       "reservations": [
           { "who": 2, "from": "2021-06-01 09:30:00", "to": "2021-06-01 10:30:00" }
        ]
     }');

First mistake: model regular columns as JSON attributes

There is no reason not to have the room name as a regular column. After all, every room will have a name, and we may want to enforce constraints like uniqueness on the room name.

Second mistake: model tabular data as JSON array

The room reservations are perfectly regular tabular data that define a many-to-many relationship between the rooms and the people. It would have been simple to model the same data with a junction table:

/* no primary key - we'll get to that later */
CREATE TABLE reservations (
   people_id bigint REFERENCES people NOT NULL,
   room_id   bigint REFERENCES rooms  NOT NULL,
   reserved  tsrange                  NOT NULL
);

INSERT INTO reservations VALUES
   (1, 1, '[2021-06-01 09:00:00,2021-06-01 10:00:00)'),
   (3, 1, '[2021-06-01 10:00:00,2021-06-01 11:30:00)'),
   (2, 1, '[2021-06-01 13:15:00,2021-06-01 14:45:00)'),
   (1, 1, '[2021-06-01 15:00:00,2021-06-01 16:00:00)'),
   (2, 2, '[2021-06-01 09:30:00,2021-06-01 10:30:00)');

Many people seem to think that storing few large rows in a table is more efficient than storing many small rows. There is some truth to that, since every row has some overhead, and PostgreSQL compresses large data. But if you want to retrieve only parts of the data, or want to modify them, many small rows are much more efficient – as we will see below.

Third mistake: store foreign keys in JSON

The "who" attribute stores a foreign key reference to people. That is not a good idea, because it is impossible for the database to enforce such a constraint: I could just as well have inserted a reference to a non-existing person. With the junction table from above, defining a foreign key is trivial.

Moreover, you often want to join on foreign keys. With JSON, that would require a cross join with the unnested JSON array:

SELECT rooms.data ->> 'name',
       people.name
FROM rooms
   CROSS JOIN LATERAL jsonb_array_elements(
                         rooms.data -> 'reservations'
                      ) AS res(j)
   JOIN people
      ON res.j ->> 'who' = people.id::text;

With the junction table, that would be

SELECT rooms.name,
       people.name
FROM rooms
   JOIN reservations AS r ON r.room_id = rooms.id
   JOIN people ON r.people_id = people.id;

You can probably guess which of these two queries will be more efficient.

Fourth mistake: modify JSON data

If you want to add a new reservation, you have to execute a statement like

UPDATE rooms
SET data = jsonb_set(
              data,
              '{reservations,100000}',
              '{"who": 3, "from": "2021-06-01 11:00:00", "to": "2021-06-01 12:00:00"}',
              TRUE
           )
WHERE id = 2;

This will fetch the complete JSON object, construct a new JSON from it and store that new object in the table. The whole JSON object has to be read and written, which is more I/O than you would want – particularly if the JSON object is large and stored out of line.

Compare how simple the same exercise would be with the junction table:

INSERT INTO reservations VALUES
   (3, 2, '[2021-06-01 11:00:00,2021-06-01 12:00:00)');

This statement will only write a small amount of data.

Deleting a reservation is just as complicated and expensive, and is left as an exercise to the reader.

Fifth mistake: trying to enforce constraints on JSON

So far, our data model offers no protection against overlapping reservations, which would be good to enforce in the database.

With JSON, we are pretty much out of luck here. The best that comes to mind is a constraint trigger, but that would require elaborate locking or the SERIALIZABLE transaction isolation level to be free from race conditions. Also, the code would be far from simple.

With the junction table, the exercise is simple; all we have to do is to add an exclusion constraint that checks for overlaps with the && operator:

CREATE EXTENSION IF NOT EXISTS btree_gist;

ALTER TABLE reservations ADD EXCLUDE USING gist (
   reserved WITH &&,
   room_id WITH =
);

The extension is required to create a GiST index on a bigint column.

Sixth mistake: complicated searches in JSON

Simple searches for equality can be performed with the JSON containment operator @>, and such searches can be supported by a GIN index. But complicated searches are a pain.

Imagine we want to search for all rooms that are occupied at 2021-06-01 15:30:00. With JSON, that would look somewhat like

SELECT id
FROM rooms
   CROSS JOIN LATERAL jsonb_array_elements(
                         rooms.data -> 'reservations'
                      ) AS elem(j)
WHERE CAST(elem.j ->> 'from' AS timestamp) <= TIMESTAMP '2021-06-01 15:30:00'
  AND CAST(elem.j ->> 'to'   AS timestamp) >  TIMESTAMP '2021-06-01 15:30:00';

With our junction table, the query becomes

SELECT room_id
FROM reservations
WHERE reserved @> TIMESTAMP '2021-06-01 15:30:00';

That query can use the GiST index from the exclusion constraint we created above.

If all the above is wrong, should we use JSON in PostgreSQL at all?

Don’t get me wrong: JSON support in PostgreSQL is a wonderful thing. It is just that many people don’t understand how to use it right. For example, the majority of questions about PostgreSQL and JSON asked on Stackoverflow are about problems that arise from the use of JSON where it had better been avoided.

Follow these guidelines when you consider using JSON in PostgreSQL:

  • Don’t use JSON for data that can easily be stored in database tables.
  • Avoid large JSON objects if you want to modify individual attributes.
  • Don’t use JSON if you want to use attributes in complicated WHERE conditions.
  • Avoid JSON if you want to join on some of the attributes.
  • Don’t use JSON if you want constraints on the data.
  • Don’t store references to other tables in JSON attributes.

Often it may be a good idea to store some attributes as regular table columns and others in a JSON. The less you need to process the data inside the database, the better it is to store them as JSON.

A good example

To show an example of how JSON in the database can be used with benefit, let us consider a shop that sells all kinds of mixed goods. There are some properties that all or most of the goods will have, like price, weight, manufacturer, number available or package size. Other attributes may be rare, like the type of power plug, or ambiguous, like the pitch in screws or tuning forks.

Rather than defining a table with hundreds of columns for all the possible attributes, most of which will be NULL, we model the most frequent attributes with normal table columns and use a JSON for the rest:

CREATE TABLE wares (
   id           bigint           PRIMARY KEY,
   name         text             UNIQUE NOT NULL,
   price        numeric(15,2)    NOT NULL,
   weight_kg    double precision NOT NULL,
   available    integer          NOT NULL
      CHECK (available >= 0),
   manufacturer text,
   package_size integer          NOT NULL
      CHECK (package_size >= 0),
   attributes   jsonb
);

/* to search in "attributes" */
CREATE INDEX ON wares USING gin (attributes);

/* for similarity search on "name" */
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX ON wares USING gin (name gin_trgm_ops);

This will allow efficient queries like

SELECT name, price, available
FROM wares
WHERE name LIKE '%tuning fork%'
  AND attributes @> '{"pitch": "a"}';

Conclusion

There is a lot you can do wrong when using JSON in PostgreSQL, especially if you are not acquainted with relational databases. However, it can be a powerful tool – when used properly.