© Laurenz Albe 2021
Sometimes customers ask me about the best choice for auto-generated primary keys. In this article, I’ll explore the options and give recommendations.
Why auto-generated primary keys?
Every table needs a primary key. In a relational database, it is important to be able to identify an individual table row. If you wonder why, search the internet for the thousands of questions asking for help with removing duplicate entries from a table.
You are well advised to choose a primary key that is not only unique, but also never changes during the lifetime of a table row. This is because foreign key constraints typically reference primary keys, and changing a primary key that is referenced elsewhere causes trouble or unnecessary work.
Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens. But typically, there is no such attribute, and you have to generate an artificial primary key. Some people even argue that you should use an artificial primary key even if there is a natural one, but I won’t go into that “holy war”.
Techniques for auto-generated primary keys in PostgreSQL
There are two basic techniques:
Generating keys with a sequence
A sequence is a database object whose sole purpose in life is to generate unique numbers. It does this using an internal counter that it increments.
Sequences are highly optimized for concurrent access, and they will never issue the same number twice. Still, accessing a sequence from many concurrent SQL statements could become a bottleneck, so there is the
CACHE option that makes the sequence hand out several values at once to database sessions.
Sequences don’t follow the normal transactional rules: if a transaction rolls back, the sequence does not reset its counter. This is required for good performance, and it does not constitute a problem. If you are looking for a way to generate a gapless sequence of numbers, a sequence is not the right choice, and you will have to resort to less efficient and more complicated techniques.
To fetch the next value from a sequence you use the
nextval function like this:
See the documentation for other functions to manipulate sequences.
A UUID (universally unique identifier) is a 128-bit number that is generated with an algorithm that effectively guarantees uniqueness. There are several standardized algorithms for that. In PostgreSQL, there are a number of functions that generate UUIDs:
uuid-osspextension offers functions to generate UUIDs. Note that because of the hyphen in the name, you have to quote the name of the extension (
CREATE EXTENSION "uuid-ossp";).
- From PostgreSQL v13 on, you can use the core function
gen_random_uuid()to generate version-4 (random) UUIDs.
Note that you should always use the PostgreSQL data type
uuid for UUIDs. Don’t try to convert them to strings or
numeric — you will waste space and lose performance.
Defining auto-generated primary keys
There are four ways to define a column with automatically generated values:
You can use this method with sequences and UUIDs. Here are some examples:
CREATE TABLE has_integer_pkey ( id bigint DEFAULT nextval('integer_id_seq') PRIMARY KEY, ... ); CREATE TABLE has_uuid_pkey ( id uuid DEFAULT gen_random_uuid() PRIMARY KEY, ... );
PostgreSQL uses the
DEFAULT value whenever the
INSERT statement doesn’t explicitly insert that column.
This method is a shortcut for defining a sequence and setting a
DEFAULT clause as above. With this method, you define a table as follows:
CREATE TABLE uses_serial ( id bigserial PRIMARY KEY, ... );
That is equivalent to the following:
CREATE TABLE uses_serial ( id bigint PRIMARY KEY, ... ); CREATE SEQUENCE uses_serial_id_seq OWNED BY uses_serial.id; ALTER TABLE uses_serial ALTER id SET DEFAULT nextval('uses_serial_id_seq');
OWNED BY” clause adds a dependency between the column and the sequence, so that dropping the column automatically drops the sequence.
serial will create an
integer column, while
bigserial will create a
Using identity columns
This is another way to use a sequence, because PostgreSQL uses sequences “behind the scenes” to implement identity columns.
CREATE TABLE uses_identity ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, ... );
There is also “
GENERATED BY DEFAULT AS IDENTITY”, which is the same, except that you won’t get an error message if you try to explicitly insert a value for the column (much like with a
DEFAULT clause). See below for more!
You can specify sequence options for identity columns:
CREATE TABLE uses_identity ( id bigint GENERATED ALWAYS AS IDENTITY (MINVALUE 0 START WITH 0 CACHE 20) PRIMARY KEY, ... );
BEFORE INSERT triggers
This is similar to
DEFAULT values, but it allows you to unconditionally override a value inserted by the user with a generated value. The big disadvantage of a trigger is the performance impact.
Should I use
bigserial) for my auto-generated primary key?
You should always use
integer occupies four bytes, while a
bigint needs eight. But:
- If you have a small table, where
integerwould suffice, the four wasted bytes won’t matter much. Also, not every table that you designed to be small will remain small!
- If you have a big table, you might exceed the maximum for
integer, which is 2147483647. Note that that could also happen if your table contains fewer rows than that: you might delete rows, and some sequence values can get “lost” by transactions that are rolled back.
Now it is quite complicated to change the primary key column from
bigintin a big table inside an active database without causing excessive down time, so you should save yourself that pain.
bigint, you are certain to never exceed the maximum of 9223372036854775807: even if you insert 10000 rows per second without any pause, you have almost 30 million years before you reach the limit.
Should I use
bigserial or an identity column for my auto-generated primary key?
You should use an identity column, unless you have to support old PostgreSQL versions.
Identity columns were introduced in PostgreSQL v11, and they have two advantages over
- They comply with the SQL standard, while
bigserialis proprietary PostgreSQL syntax. This will make your code more portable.
- If you use
GENERATED ALWAYS AS IDENTITY, you will get an error message if you try to override the generated value by explicitly inserting a number. This avoids the common problem that manually entered values will conflict with generated values later on, causing surprising application errors.
So unless you have to support PostgreSQL v10 or below, there is no reason to use
Should I use
uuid for an auto-generated primary key?
My advice is to use a sequence unless you use database sharding or have some other reason to generate primary keys in a “decentralized” fashion (outside a single database).
The advantages of
bigint are clear:
bigintuses only eight bytes, while
- fetching a value from a sequence is cheaper than calculating a UUID
One disadvantage of using a sequence is that it is a single object in a single database. So if you use sharding, where you distribute your data across several databases, you cannot use a sequence. In such a case, UUIDs are an obvious choice. (You could use sequences defined with an
INCREMENT greater than 1 and different
START values, but that might lead to problems when you add additional shards.)
Of course, if your primary key is not auto-generated by the database, but created in an application distributed across several application servers, you will also prefer UUIDs.
There are people that argue that UUIDs are better, because they spread the writes across different pages of the primary key index. That is supposed to reduce contention and lead to a more balanced or less fragmented index. The first is true, but that may actually be a disadvantage, because it requires the whole index to be cached for good performance. The second is definitely wrong, since B-tree indexes are always balanced. Also, a change in PostgreSQL v11 made sure that monotonically increasing values will fill an index more efficiently than random inserts ever could (but subsequent deletes will of course cause fragmentation). In short, any such advantages are either marginal or non-existent, and they are more than balanced by the fact that
uuid uses twice as much storage, which will make the index bigger, causing more writes and occupying more of your cache.
My co-worker Kaarel ran a small performance test a while ago and found that
uuid was slower than
bigint when it came to bigger joins.
I decided to run a small insert-only benchmark with these two tables:
CREATE UNLOGGED TABLE test_bigint ( id bigint GENERATED ALWAYS AS IDENTITY (CACHE 200) PRIMARY KEY ); CREATE UNLOGGED TABLE test_uuid ( id uuid DEFAULT gen_random_uuid() PRIMARY KEY );
I performed the benchmark on my laptop (SSD, 8 cores) with a
pgbench custom script that had 6 concurrent clients repeatedly run transactions of 1000 prepared
INSERT statements for five minutes:
INSERT INTO test_bigint /* or test_uuid */ DEFAULT VALUES;
|inserts per second||107090||74947|
|index growth per row||30.5 bytes||41.7 bytes|
bigint clearly wins, but the difference is not spectacular.
Numbers generated by a sequence and UUIDs are both useful as auto-generated primary keys.
Use identity columns unless you need to generate primary keys outside a single database, and make sure all your primary key columns are of type