CYBERTEC PostgreSQL Logo

UUID, serial or identity columns for PostgreSQL auto-generated primary keys?

05.2021 / Category: / Tags: |
Autogenerated primary keys violate human rights!
© Laurenz Albe 2021

 

UPDATED 14.05.2022: Sometimes customers ask me about the best choice for auto-generated primary keys. In this article, I'll explore the options and give recommendations.

Why auto-generated primary keys?

Every table needs a primary key. In a relational database, it is important to be able to identify an individual table row. If you wonder why, search the internet for the thousands of questions asking for help with removing duplicate entries from a table.

You are well advised to choose a primary key that is not only unique, but also never changes during the lifetime of a table row. This is because foreign key constraints typically reference primary keys, and changing a primary key that is referenced elsewhere causes trouble or unnecessary work.

Now, sometimes a table has a natural primary key, for example the social security number of a country's citizens. But typically, there is no such attribute, and you have to generate an artificial primary key. Some people even argue that you should use an artificial primary key even if there is a natural one, but I won't go into that “holy war”.

Techniques for auto-generated primary keys in PostgreSQL

There are two basic techniques:

Generating keys with a sequence

A sequence is a database object whose sole purpose in life is to generate unique numbers. It does this using an internal counter that it increments.

Sequences are highly optimized for concurrent access, and they will never issue the same number twice. Still, accessing a sequence from many concurrent SQL statements could become a bottleneck, so there is the CACHE option that makes the sequence hand out several values at once to database sessions.

Sequences don't follow the normal transactional rules: if a transaction rolls back, the sequence does not reset its counter. This is required for good performance, and it does not constitute a problem. If you are looking for a way to generate a gapless sequence of numbers, a sequence is not the right choice, and you will have to resort to less efficient and more complicated techniques.

To fetch the next value from a sequence you use the nextval function like this:

See the documentation for other functions to manipulate sequences.

Generating UUIDs

A UUID (universally unique identifier) is a 128-bit number that is generated with an algorithm that effectively guarantees uniqueness. There are several standardized algorithms for that. In PostgreSQL, there are a number of functions that generate UUIDs:

  • The uuid-ossp extension offers functions to generate UUIDs. Note that because of the hyphen in the name, you have to quote the name of the extension (CREATE EXTENSION "uuid-ossp";).
  • From PostgreSQL v13 on, you can use the core function gen_random_uuid() to generate version-4 (random) UUIDs.

Note that you should always use the PostgreSQL data type uuid for UUIDs. Don't try to convert them to strings or numeric — you will waste space and lose performance.

Defining auto-generated primary keys

There are four ways to define a column with automatically generated values:

Using the DEFAULT clause

You can use this method with sequences and UUIDs. Here are some examples:

PostgreSQL uses the DEFAULT value whenever the INSERT statement doesn't explicitly insert that column.

Using the serial and bigserial pseudo-types

This method is a shortcut for defining a sequence and setting a DEFAULT clause as above. With this method, you define a table as follows:

That is equivalent to the following:

The “OWNED BY” clause adds a dependency between the column and the sequence, so that dropping the column automatically drops the sequence.

Using serial will create an integer column, while bigserial will create a bigint column.

Using identity columns

This is another way to use a sequence, because PostgreSQL uses sequences “behind the scenes” to implement identity columns.

There is also “GENERATED BY DEFAULT AS IDENTITY”, which is the same, except that you won't get an error message if you try to explicitly insert a value for the column (much like with a DEFAULT clause). See below for more!

You can specify sequence options for identity columns:

Using BEFORE INSERT triggers

This is similar to DEFAULT values, but it allows you to unconditionally override a value inserted by the user with a generated value. The big disadvantage of a trigger is the performance impact.

Should I use integer(serial) or bigint(bigserial) for my auto-generated primary key?

You should always use bigint.

True, an integer occupies four bytes, while a bigint needs eight. But:

  • If you have a small table, where integer would suffice, the four wasted bytes won't matter much. Also, not every table that you designed to be small will remain small!
  • If you have a big table, you might exceed the maximum for integer, which is 2147483647. Note that that could also happen if your table contains fewer rows than that: you might delete rows, and some sequence values can get “lost” by transactions that are rolled back.
    Now it is quite complicated to change the primary key column from integer to bigint in a big table inside an active database without causing excessive down time, so you should save yourself that pain.

With bigint, you are certain to never exceed the maximum of 9223372036854775807: even if you insert 10000 rows per second without any pause, you have almost 30 million years before you reach the limit.

Should I use bigserial or an identity column for my auto-generated primary key?

You should use an identity column, unless you have to support old PostgreSQL versions.

Identity columns were introduced in PostgreSQL v11, and they have two advantages over bigserial:

  • They comply with the SQL standard, while bigserial is proprietary PostgreSQL syntax. This will make your code more portable.
  • If you use GENERATED ALWAYS AS IDENTITY, you will get an error message if you try to override the generated value by explicitly inserting a number. This avoids the common problem that manually entered values will conflict with generated values later on, causing surprising application errors.

So unless you have to support PostgreSQL v10 or below, there is no reason to use bigserial.

Should I use bigint or uuid for an auto-generated primary key?

My advice is to use a sequence unless you use database sharding or have some other reason to generate primary keys in a "decentralized" fashion (outside a single database).

Real differences

The advantages of bigint are clear:

  • bigint uses only eight bytes, while uuid uses 16
  • fetching a value from a sequence is cheaper than calculating a UUID

One disadvantage of using a sequence is that it is a single object in a single database. So if you use sharding, where you distribute your data across several databases, you cannot use a sequence. In such a case, UUIDs are an obvious choice. (You could use sequences defined with an INCREMENT greater than 1 and different START values, but that might lead to problems when you add additional shards.)

Of course, if your primary key is not auto-generated by the database, but created in an application distributed across several application servers, you will also prefer UUIDs.

Imaginary differences

There are people that argue that UUIDs are better, because they spread the writes across different pages of the primary key index. That is supposed to reduce contention and lead to a more balanced or less fragmented index. The first is true, but that may actually be a disadvantage, because it requires the whole index to be cached for good performance. The second is definitely wrong, since B-tree indexes are always balanced. Also, a change in PostgreSQL v11 made sure that monotonically increasing values will fill an index more efficiently than random inserts ever could (but subsequent deletes will of course cause fragmentation). In short, any such advantages are either marginal or non-existent, and they are more than balanced by the fact that uuid uses twice as much storage, which will make the index bigger, causing more writes and occupying more of your cache.

Security considerations

People have argued (see for example the comments below) that sequence-generated primary keys can leak information, because they allow you to deduce the approximate order on which rows were inserted into a table. That is true, even though I personally find it hard to imagine a case where this is a real security problem. However, if that worries you, use UUIDs and stop worrying!

Benchmarking bigint versus uuid

My co-worker Kaarel ran a small performance test a while ago and found that uuid was slower than bigint when it came to bigger joins.

I decided to run a small insert-only benchmark with these two tables:

I performed the benchmark on my laptop (SSD, 8 cores) with a pgbench custom script that had 6 concurrent clients repeatedly run transactions of 1000 prepared INSERT statements for five minutes:

Performance comparison bigint versus uuid autogenerated primary keys
bigint uuid
inserts per second 107090 74947
index growth per row 30.5 bytes 41.7 bytes

Using bigint clearly wins, but the difference is not spectacular.

Conclusion

Numbers generated by a sequence and UUIDs are both useful as auto-generated primary keys.

Use identity columns unless you need to generate primary keys outside a single database, and make sure all your primary key columns are of type bigint.

If you are interested in reading more about primary keys, see also Hans' post on Primary Keys vs. Unique Constraints.

In other news, in PostgreSQL 15 default permissions for the public schema were modified, which can cause errors.
Find out more about this important change here.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on TwitterFacebook, or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
15 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Georg Klimm
1 year ago

I have not yet understood the advantage of loading all tables with UUIDs. However, this seems to be common practice by now.
One table with UUID may not be the problem yet. Maybe you should compare several with UUID and several JOINS against a solution with a "normal" ID (bigint), so that the UUID generation understands what it is doing to the database and where the performance is ...

Joseph
Joseph
2 years ago

Good article Laurenz, convinced me to go with bigint for my database.

Jimbo
Jimbo
2 years ago

Laurenz,
Thanks for the great article. I appreciated the clear advice, comparisons and benchmarks.

Is there a recommended technique available in Postgres for when you want to use bigint as the primary key but still have a 'public' identifier that is not guessable (something that looks like a UUID or hash)?

Amreesh
Amreesh
2 years ago
Reply to  Jimbo

I have created a module goldflake which can be used to generate random bigint in postgres as well as in golang if you need to generate the same in application. It is based on Twitter snowflake.
https://github.com/AmreeshTyagi/goldflake-pg

Amreesh
Amreesh
2 years ago
Reply to  Jimbo

I have created a module goldflake which can be used to generate random bigint in postgres as well as in golang if you need to generate the same in application. It is based on Twitter snowflake.
https://github.com/AmreeshTyagi/goldflake https://github.com/AmreeshTyagi/goldflake-pg

And goldflake id is not guessable pure random, no conflict for 174 years even if you generate them on 8k different machines at same time.

Feel free to check code & raise PR if you find any issue. Thanks.

fr0bar
fr0bar
1 month ago
Reply to  Jimbo

Look at squids.

madovsky
madovsky
2 years ago

good article, but the comics looks very stupid.

laurenz
laurenz
2 years ago
Reply to  madovsky

My expertise is databases.
I am not trying to make a career as an artist..

Michael Kariv
Michael Kariv
2 years ago

I would like to add one possible advantage of uuid, which might be a private case of already stated advantage of a distributed application. Say you have a client mobile or web app and an API server. The app would let user generate a content item. Then it would REST JSON POST it to the server. With uuid as a primary key, it could have been generated on the client side (the app), so that the client logic is simpler - no waiting for the item back in response, and replacing the one that was client-generated already with the new one that has primary key (serial id) filled in by the server. The performance might add up, taking into consideration the offline scenario for a mobile app, where items created in an offline mode will have to be posted to the API server all at once once the app is back online.

sribe
sribe
3 years ago

> Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens.

1) Children are citizens at birth, but do not have SS #s at birth.

2) Back in the 1970s, a few duplicate SSNs were accidentally issued.

3) A person can request a new SSN under a few circumstances.

So social security numbers fail as natural keys on ALL requirements.

I've never really seen anyone argue that natural keys shouldn't be used when available, but rather that natural keys are nowhere near as prevalent as their proponents claim.

laurenz
laurenz
3 years ago
Reply to  sribe

I won't argue that good natural primary keys are common, and I grant your points (except that I live in a country with a functional bureaucracy that doesn't issue duplicate social security numbers).

Adam Brusselback
Adam Brusselback
3 years ago

Why not include some more "well behaved" UUID generation methods rather than random? The Sequential UUIDs extension is quite useful and helps with index bloat / WAL generation: https://pgxn.org/dist/sequential_uuids

Colin 't Hart
1 year ago

I have a customer using UUIDs running a batch job that deletes thousands of rows, does some calculations, and inserts thousands more rows. Usually the deletes are sub-second, but occasionally they balloon out over 20 seconds -- while a count(*) on the same table forced to use a full table scan takes around 16 seconds.

I wonder if there any implications for the statistics collected by analyze on a UUID key vs a (big)int?

The where clause for the deletion contains a date, status, and an array of UUIDs.

laurenz
laurenz
1 year ago
Reply to  Colin 't Hart

UUIDs are a bit slower, but that won't explain what you observe. I'd turn on log_lock_waits and see if locks are involved.

Prasad Nair
Prasad Nair
10 months ago

I agree!

Security reviewers who approve proposals based on "security by obscurity" are shallow thinkers. I once under pressure had to obfuscate my java classes(of an encryption program that uses AES like algo internally) just to satisfy the ego of a security reviewer who argued the byte code can be easily decompiled to get the source code. Just 3 months down the line a smarter reviewer got to my custom classloader class to find what my obfuscation logic was.. Busted! In fact, the real strength of my program lied NOT in the AES like algo BUT in the length and randomness of the shared secret. Hence the fear that anyone would decompile the bytecode and get to the source code in this case was baseless.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram