Some weeks ago at pgDay Paris, during the evening social event, we got into a small “guestimation” with another Postgres enthusiast, about the precepted speed of using integer based ID-s vs UUID based ones for bigger amounts of data. In short he reasoned that one should generally avoid the UUID-s for performance reasons as they take 16 bytes instead of 4/8 bytes for normal integer and thus are slower. While it is true that they do occupy some more bytes on the disk (and are a bit rare in the wild as simpler applications just don’t need them) I was still sceptical about the direct translation of those bytes to speed difference. More specifically we talked about bigger joins (full table scans), as on indexed single row operations there shouldn’t be any difference anyways. In the end it was hard to say anything meaningful as I was just guessing – but now I finally got to testing it out with a simple test setup and am now laying out the numbers for you to judge also. In addition to UUID I also included int8 and numeric for completeness.
I created 4 pairs of tables with 5 million rows with same ID-s in both of the tables, that would result in a “complete” join. Then I had to make a difficult choice – as the discussion rose from real life situations, it would be pretty meaningless to have only one table consisting of the ID column. Thus, to make the test a bit more real, I decided to add 6 other columns (2 ints, 2 timestamps, 2 short texts) so that the extra 4 bytes wouldn’t have such a huge effect on the outcome. Then I launched a new test cluster with the latest Postgresql version 9.6.2, configured the shared_buffers=2GB, work_mem=1GB and maintenance_work_mem=1GB so that the join would be performed in the memory and also enabled the pg_stat_statements extension. Then I left the test script to run over night on my laptop (64-bit Ubuntu 14.04, 16 GB RAM, i5-6200U @ 2.30 GHz, SSD), doing
1) 1000 join iterations for every data type and
2) 100 index building iterations.
The scripts can be found here if you want to try it out yourself. Note that I’m using UNLOGGED tables, a feature introduced with Postgres 9.5, to reduce IO dependence of our test (and speed up inserts).
Tables ended up being around 364..403 MB in size and looked like that in the end:
krl@postgres=# \d int4_a Unlogged table "public.int4_a" Column │ Type │ Modifiers ────────┼──────────────────────────┼─────────── id │ integer │ col1 │ integer │ col2 │ integer │ col3 │ text │ col4 │ text │ col5 │ timestamp with time zone │ col6 │ timestamp with time zone │
And the test queries looked like that:
SELECT count(*) FROM datatypeX_a JOIN datatypeX_b USING(id)
|Datatype||JOIN mean time 5m rows (s)||% diff. to smallest||Index Size (MB)||% diff. to smallest|
|uuid||3.07||+13 %||150||+40 %|
Looking at the results table, we see that indeed there’s a penalty when starting to use any of the bigger data types. It makes kind of sense. But is this +13% growth, int4 vs UUID, little or much? I’d rather think it is small enough to be blissfully dismissed still, given that they benefit application architecture when using shards for example and also they provide “some” security as guessing or looping over them is not so easily possible. The thing that surprised me the most though was the performance of numeric-s. +34%! One should definitely watch out there and avoid them if possible – i.e. a very bad choice for normally incrementing integer ID-s.
As it was a pure syntethical test, in real life scenarios one has more processes fighting for resources, locking, also more bloat, most probably more columns in the tables, thus making waiting for disk access more relevant so that the real performance loss from processing those extra bytes spent on the ID column should be actually smaller.
By the way, by googling I also found a couple of articles on data type speed differences when doing a lot of INSERTs. There the penalty was similarly modest in most cases, so in general – nothing to be afraid of. So in the end I’d still recommend to choose ID column data types based on utility, not performance. And as always – one should always generate a lot of test data carefully within an application specific scenario before betting on a certain approach.