Table of Contents
UPDATED 14.05.2022: Sometimes customers ask me about the best choice for auto-generated primary keys. In this article, I'll explore the options and give recommendations.
Every table needs a primary key. In a relational database, it is important to be able to identify an individual table row. If you wonder why, search the internet for the thousands of questions asking for help with removing duplicate entries from a table.
You are well advised to choose a primary key that is not only unique, but also never changes during the lifetime of a table row. This is because foreign key constraints typically reference primary keys, and changing a primary key that is referenced elsewhere causes trouble or unnecessary work.
Now, sometimes a table has a natural primary key, for example the social security number of a country's citizens. But typically, there is no such attribute, and you have to generate an artificial primary key. Some people even argue that you should use an artificial primary key even if there is a natural one, but I won't go into that “holy war”.
There are two basic techniques:
A sequence is a database object whose sole purpose in life is to generate unique numbers. It does this using an internal counter that it increments.
Sequences are highly optimized for concurrent access, and they will never issue the same number twice. Still, accessing a sequence from many concurrent SQL statements could become a bottleneck, so there is the CACHE
option that makes the sequence hand out several values at once to database sessions.
Sequences don't follow the normal transactional rules: if a transaction rolls back, the sequence does not reset its counter. This is required for good performance, and it does not constitute a problem. If you are looking for a way to generate a gapless sequence of numbers, a sequence is not the right choice, and you will have to resort to less efficient and more complicated techniques.
To fetch the next value from a sequence you use the nextval
function like this:
1 |
SELECT nextval('sequence_name'); |
See the documentation for other functions to manipulate sequences.
A UUID (universally unique identifier) is a 128-bit number that is generated with an algorithm that effectively guarantees uniqueness. There are several standardized algorithms for that. In PostgreSQL, there are a number of functions that generate UUIDs:
uuid-ossp
extension offers functions to generate UUIDs. Note that because of the hyphen in the name, you have to quote the name of the extension (CREATE EXTENSION "uuid-ossp";
).gen_random_uuid()
to generate version-4 (random) UUIDs.Note that you should always use the PostgreSQL data type uuid
for UUIDs. Don't try to convert them to strings or numeric
— you will waste space and lose performance.
There are four ways to define a column with automatically generated values:
DEFAULT
clauseYou can use this method with sequences and UUIDs. Here are some examples:
1 2 3 4 5 6 7 8 9 |
CREATE TABLE has_integer_pkey ( id bigint DEFAULT nextval('integer_id_seq') PRIMARY KEY, ... ); CREATE TABLE has_uuid_pkey ( id uuid DEFAULT gen_random_uuid() PRIMARY KEY, ... ); |
PostgreSQL uses the DEFAULT
value whenever the INSERT
statement doesn't explicitly insert that column.
serial
and bigserial
pseudo-typesThis method is a shortcut for defining a sequence and setting a DEFAULT
clause as above. With this method, you define a table as follows:
1 2 3 4 |
CREATE TABLE uses_serial ( id bigserial PRIMARY KEY, ... ); |
That is equivalent to the following:
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE uses_serial ( id bigint PRIMARY KEY, ... ); CREATE SEQUENCE uses_serial_id_seq OWNED BY uses_serial.id; ALTER TABLE uses_serial ALTER id SET DEFAULT nextval('uses_serial_id_seq'); |
The “OWNED BY
” clause adds a dependency between the column and the sequence, so that dropping the column automatically drops the sequence.
Using serial
will create an integer
column, while bigserial
will create a bigint
column.
This is another way to use a sequence, because PostgreSQL uses sequences “behind the scenes” to implement identity columns.
1 2 3 4 5 |
CREATE TABLE uses_identity ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, ... ); |
There is also “GENERATED BY DEFAULT AS IDENTITY
”, which is the same, except that you won't get an error message if you try to explicitly insert a value for the column (much like with a DEFAULT
clause). See below for more!
You can specify sequence options for identity columns:
1 2 3 4 5 6 |
CREATE TABLE uses_identity ( id bigint GENERATED ALWAYS AS IDENTITY (MINVALUE 0 START WITH 0 CACHE 20) PRIMARY KEY, ... ); |
BEFORE INSERT
triggersThis is similar to DEFAULT
values, but it allows you to unconditionally override a value inserted by the user with a generated value. The big disadvantage of a trigger is the performance impact.
integer
(serial
) or bigint
(bigserial
) for my auto-generated primary key?You should always use bigint
.
True, an integer
occupies four bytes, while a bigint
needs eight. But:
integer
would suffice, the four wasted bytes won't matter much. Also, not every table that you designed to be small will remain small!integer
, which is 2147483647. Note that that could also happen if your table contains fewer rows than that: you might delete rows, and some sequence values can get “lost” by transactions that are rolled back.integer
to bigint
in a big table inside an active database without causing excessive down time, so you should save yourself that pain.With bigint
, you are certain to never exceed the maximum of 9223372036854775807: even if you insert 10000 rows per second without any pause, you have almost 30 million years before you reach the limit.
bigserial
or an identity column for my auto-generated primary key?You should use an identity column, unless you have to support old PostgreSQL versions.
Identity columns were introduced in PostgreSQL v11, and they have two advantages over bigserial
:
bigserial
is proprietary PostgreSQL syntax. This will make your code more portable.GENERATED ALWAYS AS IDENTITY
, you will get an error message if you try to override the generated value by explicitly inserting a number. This avoids the common problem that manually entered values will conflict with generated values later on, causing surprising application errors.So unless you have to support PostgreSQL v10 or below, there is no reason to use bigserial
.
bigint
or uuid
for an auto-generated primary key?My advice is to use a sequence unless you use database sharding or have some other reason to generate primary keys in a "decentralized" fashion (outside a single database).
The advantages of bigint
are clear:
bigint
uses only eight bytes, while uuid
uses 16One disadvantage of using a sequence is that it is a single object in a single database. So if you use sharding, where you distribute your data across several databases, you cannot use a sequence. In such a case, UUIDs are an obvious choice. (You could use sequences defined with an INCREMENT
greater than 1 and different START
values, but that might lead to problems when you add additional shards.)
Of course, if your primary key is not auto-generated by the database, but created in an application distributed across several application servers, you will also prefer UUIDs.
There are people that argue that UUIDs are better, because they spread the writes across different pages of the primary key index. That is supposed to reduce contention and lead to a more balanced or less fragmented index. The first is true, but that may actually be a disadvantage, because it requires the whole index to be cached for good performance. The second is definitely wrong, since B-tree indexes are always balanced. Also, a change in PostgreSQL v11 made sure that monotonically increasing values will fill an index more efficiently than random inserts ever could (but subsequent deletes will of course cause fragmentation). In short, any such advantages are either marginal or non-existent, and they are more than balanced by the fact that uuid
uses twice as much storage, which will make the index bigger, causing more writes and occupying more of your cache.
People have argued (see for example the comments below) that sequence-generated primary keys can leak information, because they allow you to deduce the approximate order on which rows were inserted into a table. That is true, even though I personally find it hard to imagine a case where this is a real security problem. However, if that worries you, use UUIDs and stop worrying!
bigint
versus uuid
My co-worker Kaarel ran a small performance test a while ago and found that uuid
was slower than bigint
when it came to bigger joins.
I decided to run a small insert-only benchmark with these two tables:
1 2 3 4 5 6 7 |
CREATE UNLOGGED TABLE test_bigint ( id bigint GENERATED ALWAYS AS IDENTITY (CACHE 200) PRIMARY KEY ); CREATE UNLOGGED TABLE test_uuid ( id uuid DEFAULT gen_random_uuid() PRIMARY KEY ); |
I performed the benchmark on my laptop (SSD, 8 cores) with a pgbench
custom script that had 6 concurrent clients repeatedly run transactions of 1000 prepared INSERT
statements for five minutes:
1 |
INSERT INTO test_bigint /* or test_uuid */ DEFAULT VALUES; |
bigint |
uuid |
|
inserts per second | 107090 | 74947 |
index growth per row | 30.5 bytes | 41.7 bytes |
Using bigint
clearly wins, but the difference is not spectacular.
Numbers generated by a sequence and UUIDs are both useful as auto-generated primary keys.
Use identity columns unless you need to generate primary keys outside a single database, and make sure all your primary key columns are of type bigint
.
If you are interested in reading more about primary keys, see also Hans' post on Primary Keys vs. Unique Constraints.
In other news, in PostgreSQL 15 default permissions for the public schema were modified, which can cause errors.
Find out more about this important change here.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Why not include some more "well behaved" UUID generation methods rather than random? The Sequential UUIDs extension is quite useful and helps with index bloat / WAL generation: https://pgxn.org/dist/sequential_uuids
> Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens.
1) Children are citizens at birth, but do not have SS #s at birth.
2) Back in the 1970s, a few duplicate SSNs were accidentally issued.
3) A person can request a new SSN under a few circumstances.
So social security numbers fail as natural keys on ALL requirements.
I've never really seen anyone argue that natural keys shouldn't be used when available, but rather that natural keys are nowhere near as prevalent as their proponents claim.
I won't argue that good natural primary keys are common, and I grant your points (except that I live in a country with a functional bureaucracy that doesn't issue duplicate social security numbers).
I would like to add one possible advantage of uuid, which might be a private case of already stated advantage of a distributed application. Say you have a client mobile or web app and an API server. The app would let user generate a content item. Then it would REST JSON POST it to the server. With uuid as a primary key, it could have been generated on the client side (the app), so that the client logic is simpler - no waiting for the item back in response, and replacing the one that was client-generated already with the new one that has primary key (serial id) filled in by the server. The performance might add up, taking into consideration the offline scenario for a mobile app, where items created in an offline mode will have to be posted to the API server all at once once the app is back online.
good article, but the comics looks very stupid.
My expertise is databases.
I am not trying to make a career as an artist..
Laurenz,
Thanks for the great article. I appreciated the clear advice, comparisons and benchmarks.
Is there a recommended technique available in Postgres for when you want to use
bigint
as the primary key but still have a 'public' identifier that is not guessable (something that looks like a UUID or hash)?I have created a module goldflake which can be used to generate random
bigint
in postgres as well as in golang if you need to generate the same in application. It is based on Twitter snowflake.https://github.com/AmreeshTyagi/goldflake https://github.com/AmreeshTyagi/goldflake-pg
And goldflake id is not guessable pure random, no conflict for 174 years even if you generate them on 8k different machines at same time.
Feel free to check code & raise PR if you find any issue. Thanks.
I have created a module goldflake which can be used to generate random
bigint
in postgres as well as in golang if you need to generate the same in application. It is based on Twitter snowflake.https://github.com/AmreeshTyagi/goldflake-pg
Look at squids.
Good article Laurenz, convinced me to go with bigint for my database.
I have not yet understood the advantage of loading all tables with UUIDs. However, this seems to be common practice by now.
One table with UUID may not be the problem yet. Maybe you should compare several with UUID and several JOINS against a solution with a "normal" ID (bigint), so that the UUID generation understands what it is doing to the database and where the performance is ...
I have a customer using UUIDs running a batch job that deletes thousands of rows, does some calculations, and inserts thousands more rows. Usually the deletes are sub-second, but occasionally they balloon out over 20 seconds -- while a count(*) on the same table forced to use a full table scan takes around 16 seconds.
I wonder if there any implications for the statistics collected by analyze on a UUID key vs a (big)int?
The where clause for the deletion contains a date, status, and an array of UUIDs.
UUIDs are a bit slower, but that won't explain what you observe. I'd turn on
log_lock_waits
and see if locks are involved.I agree!
Security reviewers who approve proposals based on "security by obscurity" are shallow thinkers. I once under pressure had to obfuscate my java classes(of an encryption program that uses AES like algo internally) just to satisfy the ego of a security reviewer who argued the byte code can be easily decompiled to get the source code. Just 3 months down the line a smarter reviewer got to my custom classloader class to find what my obfuscation logic was.. Busted! In fact, the real strength of my program lied NOT in the AES like algo BUT in the length and randomness of the shared secret. Hence the fear that anyone would decompile the bytecode and get to the source code in this case was baseless.