Autogenerated primary keys violate human rights!
© Laurenz Albe 2021

Sometimes customers ask me about the best choice for auto-generated primary keys. In this article, I’ll explore the options and give recommendations.

Why auto-generated primary keys?

Every table needs a primary key. In a relational database, it is important to be able to identify an individual table row. If you wonder why, search the internet for the thousands of questions asking for help with removing duplicate entries from a table.

You are well advised to choose a primary key that is not only unique, but also never changes during the lifetime of a table row. This is because foreign key constraints typically reference primary keys, and changing a primary key that is referenced elsewhere causes trouble or unnecessary work.

Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens. But typically, there is no such attribute, and you have to generate an artificial primary key. Some people even argue that you should use an artificial primary key even if there is a natural one, but I won’t go into that “holy war”.

Techniques for auto-generated primary keys in PostgreSQL

There are two basic techniques:

Generating keys with a sequence

A sequence is a database object whose sole purpose in life is to generate unique numbers. It does this using an internal counter that it increments.

Sequences are highly optimized for concurrent access, and they will never issue the same number twice. Still, accessing a sequence from many concurrent SQL statements could become a bottleneck, so there is the CACHE option that makes the sequence hand out several values at once to database sessions.

Sequences don’t follow the normal transactional rules: if a transaction rolls back, the sequence does not reset its counter. This is required for good performance, and it does not constitute a problem. If you are looking for a way to generate a gapless sequence of numbers, a sequence is not the right choice, and you will have to resort to less efficient and more complicated techniques.

To fetch the next value from a sequence you use the nextval function like this:

SELECT nextval('sequence_name');

See the documentation for other functions to manipulate sequences.

Generating UUIDs

A UUID (universally unique identifier) is a 128-bit number that is generated with an algorithm that effectively guarantees uniqueness. There are several standardized algorithms for that. In PostgreSQL, there are a number of functions that generate UUIDs:

  • The uuid-ossp extension offers functions to generate UUIDs. Note that because of the hyphen in the name, you have to quote the name of the extension (CREATE EXTENSION "uuid-ossp";).
  • From PostgreSQL v13 on, you can use the core function gen_random_uuid() to generate version-4 (random) UUIDs.

Note that you should always use the PostgreSQL data type uuid for UUIDs. Don’t try to convert them to strings or numeric — you will waste space and lose performance.

Defining auto-generated primary keys

There are four ways to define a column with automatically generated values:

Using the DEFAULT clause

You can use this method with sequences and UUIDs. Here are some examples:

CREATE TABLE has_integer_pkey (
   id bigint DEFAULT nextval('integer_id_seq') PRIMARY KEY,
   ...
);

CREATE TABLE has_uuid_pkey (
   id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
   ...
);

PostgreSQL uses the DEFAULT value whenever the INSERT statement doesn’t explicitly insert that column.

Using the serial and bigserial pseudo-types

This method is a shortcut for defining a sequence and setting a DEFAULT clause as above. With this method, you define a table as follows:

CREATE TABLE uses_serial (
   id bigserial PRIMARY KEY,
   ...
);

That is equivalent to the following:

CREATE TABLE uses_serial (
   id bigint PRIMARY KEY,
   ...
);

CREATE SEQUENCE uses_serial_id_seq
   OWNED BY uses_serial.id;

ALTER TABLE uses_serial ALTER id
   SET DEFAULT nextval('uses_serial_id_seq');

The “OWNED BY” clause adds a dependency between the column and the sequence, so that dropping the column automatically drops the sequence.

Using serial will create an integer column, while bigserial will create a bigint column.

Using identity columns

This is another way to use a sequence, because PostgreSQL uses sequences “behind the scenes” to implement identity columns.

CREATE TABLE uses_identity (
   id bigint GENERATED ALWAYS AS IDENTITY
             PRIMARY KEY,
   ...
);

There is also “GENERATED BY DEFAULT AS IDENTITY”, which is the same, except that you won’t get an error message if you try to explicitly insert a value for the column (much like with a DEFAULT clause). See below for more!

You can specify sequence options for identity columns:

CREATE TABLE uses_identity (
   id bigint GENERATED ALWAYS AS IDENTITY
             (MINVALUE 0 START WITH 0 CACHE 20)
             PRIMARY KEY,
   ...
);

Using BEFORE INSERT triggers

This is similar to DEFAULT values, but it allows you to unconditionally override a value inserted by the user with a generated value. The big disadvantage of a trigger is the performance impact.

Should I use integer(serial) or bigint(bigserial) for my auto-generated primary key?

You should always use bigint.

True, an integer occupies four bytes, while a bigint needs eight. But:

  • If you have a small table, where integer would suffice, the four wasted bytes won’t matter much. Also, not every table that you designed to be small will remain small!
  • If you have a big table, you might exceed the maximum for integer, which is 2147483647. Note that that could also happen if your table contains fewer rows than that: you might delete rows, and some sequence values can get “lost” by transactions that are rolled back.
    Now it is quite complicated to change the primary key column from integer to bigint in a big table inside an active database without causing excessive down time, so you should save yourself that pain.

With bigint, you are certain to never exceed the maximum of 9223372036854775807: even if you insert 10000 rows per second without any pause, you have almost 30 million years before you reach the limit.

Should I use bigserial or an identity column for my auto-generated primary key?

You should use an identity column, unless you have to support old PostgreSQL versions.

Identity columns were introduced in PostgreSQL v11, and they have two advantages over bigserial:

  • They comply with the SQL standard, while bigserial is proprietary PostgreSQL syntax. This will make your code more portable.
  • If you use GENERATED ALWAYS AS IDENTITY, you will get an error message if you try to override the generated value by explicitly inserting a number. This avoids the common problem that manually entered values will conflict with generated values later on, causing surprising application errors.

So unless you have to support PostgreSQL v10 or below, there is no reason to use bigserial.

Should I use bigint or uuid for an auto-generated primary key?

My advice is to use a sequence unless you use database sharding or have some other reason to generate primary keys in a “decentralized” fashion (outside a single database).

Real differences

The advantages of bigint are clear:

  • bigint uses only eight bytes, while uuid uses 16
  • fetching a value from a sequence is cheaper than calculating a UUID

One disadvantage of using a sequence is that it is a single object in a single database. So if you use sharding, where you distribute your data across several databases, you cannot use a sequence. In such a case, UUIDs are an obvious choice. (You could use sequences defined with an INCREMENT greater than 1 and different START values, but that might lead to problems when you add additional shards.)

Of course, if your primary key is not auto-generated by the database, but created in an application distributed across several application servers, you will also prefer UUIDs.

Imaginary differences

There are people that argue that UUIDs are better, because they spread the writes across different pages of the primary key index. That is supposed to reduce contention and lead to a more balanced or less fragmented index. The first is true, but that may actually be a disadvantage, because it requires the whole index to be cached for good performance. The second is definitely wrong, since B-tree indexes are always balanced. Also, a change in PostgreSQL v11 made sure that monotonically increasing values will fill an index more efficiently than random inserts ever could (but subsequent deletes will of course cause fragmentation). In short, any such advantages are either marginal or non-existent, and they are more than balanced by the fact that uuid uses twice as much storage, which will make the index bigger, causing more writes and occupying more of your cache.

Benchmarking bigint versus uuid

My co-worker Kaarel ran a small performance test a while ago and found that uuid was slower than bigint when it came to bigger joins.

I decided to run a small insert-only benchmark with these two tables:

CREATE UNLOGGED TABLE test_bigint (
   id bigint GENERATED ALWAYS AS IDENTITY (CACHE 200) PRIMARY KEY
);

CREATE UNLOGGED TABLE test_uuid (
   id uuid DEFAULT gen_random_uuid() PRIMARY KEY
);

I performed the benchmark on my laptop (SSD, 8 cores) with a pgbench custom script that had 6 concurrent clients repeatedly run transactions of 1000 prepared INSERT statements for five minutes:

INSERT INTO test_bigint /* or test_uuid */ DEFAULT VALUES;
Performance comparison bigint versus uuid autogenerated primary keys
bigintuuid
inserts per second10709074947
index growth per row30.5 bytes41.7 bytes

Using bigint clearly wins, but the difference is not spectacular.

Conclusion

Numbers generated by a sequence and UUIDs are both useful as auto-generated primary keys.

Use identity columns unless you need to generate primary keys outside a single database, and make sure all your primary key columns are of type bigint.