I decided to start out this year by looking into my notes from last year and extracting from them a small set of Postgres tips for you. This might not be, strictly speaking, fresh news... but I guess the average RAM of tech workers is about 64KB or so nowadays, so some repetition might not be such a bad thing.
Table of Contents
This is a huge one – and you're really missing out if you've not heard of this feature or are not using it enough... as too often seems to be the case in my daily consulting work! Thus this was the first item on my list from last year. For larger amounts of data, partial indexes can literally be a life-saving trick when your back is already against the wall due to some unexpected increase in workload or disk usage. One of the reasons this feature is relatively unknown is that most other popular DBMS engines don't have it at all, or they call it by another name.
The crux of the thing is super easy to remember – stop indexing the most common values! Since Postgres knows what your data looks like, it does not use the index whenever you search for a value that’s too common! As always, when you reduce indexes, you're not just winning on the storage - you can also avoid the penalties incurred when you're inserting or modifying the data. So: with partial indexes, when a new column contains the most common value, we don't have to go and touch the index at all!
When does a value become “too common”? Sadly, there is no simple rule for this, as it depends on a couple of other config/data layout variables as well as the actual queries themselves - but roughly, starting at about 25-50% of the total rows, Postgres will tend to ignore the indexes.
Here is a small test table with 100 million rows to help you visualize the possible size benefits.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
CREATE UNLOGGED TABLE t_event ( id int GENERATED BY DEFAULT AS IDENTITY, state text NOT NULL, data jsonb NOT NULL ); /* 1% NEW, 4% IN_PROCESSING, 95% FINISHED */ INSERT INTO t_event (state, data) SELECT CASE WHEN random() < 0.95 THEN 'FINISHED' ELSE CASE WHEN random() < 0.8 THEN 'IN_PROCESSING' ELSE 'NEW' END END, '{}' FROM generate_series(1, 5 * 1e7); postgres=# dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+---------+-------+----------+---------+------------- public | t_event | table | postgres | 2489 MB | (1 row) CREATE INDEX ON t_event (state); CREATE INDEX ON t_event (state) WHERE state <> 'FINISHED'; postgres=# di+ t_event_state_* List of relations Schema | Name | Type | Owner | Table | Size | Description --------+--------------------+-------+----------+---------+---------+------------- public | t_event_state_idx | index | postgres | t_event | 1502 MB | public | t_event_state_idx1 | index | postgres | t_event | 71 MB | (2 rows) |
Another tip on partial indexes (to finish off the topic) – usually, another perfect set of candidates are any columns that are left mostly empty, i.e., "NULL". The thing is that unlike Oracle, PostgreSQL has default all-encompassing indexes, so that all the NULL-s are physically stored in the index!
Quite often, you will need to evaluate the approximate size of a number of rows of data for capacity planning reasons. In these cases, there are a couple of approaches. The simplest ones consist of looking at the table size and row count and then doing some elementary math. However, the simplest approaches can be too rough and inaccurate – most often due to bloat; another reason could be that some historical data might not have some recently added columns (for example, due to some peculiarities of when Postgres built-in compression kicks in, etc.) The fastest way to get an estimate is to use the EXPLAIN command. EXPLAIN has the information embedded on the "cost" side - but do note that the estimate may use stale statistics, and it's generally pretty inaccurate on "toasted" columns.
1 2 3 4 5 |
explain select * from pgbench_accounts a; QUERY PLAN --------------------------------------------------------------------------- Seq Scan on pgbench_accounts a (cost=0.00..2650.66 rows=100366 width=97) (1 row) |
So what do you do if you want to estimate the storage size of some recent data accurately?
The function listed below, which answers the question raised above, is a welcome discovery for most people when they see it in action. Probably, not many people know about this due to the somewhat "hidden" function name (naming things is one of the two most difficult problems in computer science 🙂 ), and due to the fact that we're abusing the function - which is expecting a column - by feeding in a whole row! Remember: in Postgres all tables automatically get a virtual table type with all columns that is kind of a "scalar" - if they are not, for example, unpacked with "t.*".
1 2 3 4 5 |
select avg(pg_column_size(a)) from pgbench_accounts a; avg ---------------------- 120.9999700000000000 (1 row) |
The advantage of this function is that we can exactly specify and inspect the data that we're interested in. It also handles the disk storage side of things like compression and row headers!
Keep in mind that the size determined still does not usually map 1-to-1 to real-life disk usage for larger amounts of data due to good old companions like "bloat" and normal "defragmentation" or "air bubbles". So if you want super-sharp estimates, there's no way around generating some real-life test data. But do use a TEMP or UNLOGGED table if you do so; no need to create a huge WAL spike...especially if you happen to run a bunch of replicas.
A fun (or maybe sad) fact from the Postgres consulting trenches - most Postgres users are running some pretty darn old versions! One of many reasons behind that is commonly voiced as, “We cannot stop our business for so and so many minutes.” I say - fair enough, minutes are bad...but with Logical Replication (LR) we're talking about seconds!!! It’s no magic, far from it - the built-in LR introduced in v10 couldn't be any simpler! We’ve performed many such migrations. In the vast majority of cases, it worked smoothly - as intended! In some cases, there might be problems if the verification/switchover period is overly prolonged...
In short - with LR, there is really no reason to run some ancient (well, in "internet years" at least 🙂 ) versions. If my article got you curious and you want to learn more about this topic, I'd suggest starting here.
By the way, if you're on some pre v10 instance, but higher or equal to v9.4, then logical upgrades are also possible via a 3rd party plugin called "pglogical", which often worked quite well when I used it some years ago.
What if you’re unhappy with your indexing performance? Maybe there are some other more cool index types available for your column? Postgres has a bunch of those, actually...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
postgres=# SELECT DISTINCT a.amname FROM pg_type t JOIN pg_opclass o ON o.opcintype = t.oid JOIN pg_am a ON a.oid = o.opcmethod WHERE t.typname = 'int4'; amname -------- btree hash brin (3 rows) |
And that's not all - after declaring some "magic" extensions:
1 2 |
CREATE EXTENSION btree_gin; CREATE EXTENSION btree_gist; |
the picture changes to something like that below. Quite a lot of stuff to choose from for our good old integer!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
postgres=# SELECT DISTINCT a.amname FROM pg_type t JOIN pg_opclass o ON o.opcintype = t.oid JOIN pg_am a ON a.oid = o.opcmethod WHERE t.typname = 'int4'; amname -------- bloom brin btree gin gist hash (6 rows) |
By the way, if you're thinking, “What the hell are all those weird index types? Why should I give them a chance?”, then I would recommend starting here and here.
Since we are talking about data types...how do we determine what is actually delivered to us by the Postgres engine in tabular format? What type of data is in column XYZ for example?
The thing is that sometimes you get some above-average funky query sent to you where developers are having difficulties with "you might need to add explicit type casts" complaints from the server, converting some end results of a dozen sub-selects and transformations, so that the original column and its data type have already fallen into an abyss. Or, maybe the ORM needs an exact data type specified for your query, but runtime metadata introspection might be a painful task in the programming language at hand.
So here’s a quick tip on how to employ a PostgreSQL DBA's best friend "psql" (sorry, doggos) for that purpose:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
postgres=# SELECT 'jambo' mambo, 1 a, random(), array[1,2]; mambo | a | random | array -------+---+---------------------+------- jambo | 1 | 0.28512335791514687 | {1,2} (1 row) postgres=# gdesc Column | Type --------+------------------ mambo | text a | integer random | double precision array | integer[] (4 rows) |
This only works starting from v11 of Postgres.
Thanks for reading, and hopefully it got you back on track to start learning some new Postgres stuff again in 2021! Check out last years blogpost for more tips.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Hi Kareel,
I want to thank you for your blog, there is always good information.
Partial index are nice but I am concerned about index bloat and fragmentation on columns that contains dynamic data (updated many times like a status).
Would an index on such kind of column may experience bloat and fragmentation issues?
I would like to know your advice.
Hey, thanks! Yes your thinking is valid - avoiding indexing on frequently updated columns is a good idea if possible. Normal, non-HOT updates, always contribute to index bloat in Postgres. To make it a bit better though and reduce page splits the default fillfactor for indexes is 90% by default.
Hi Kareel,
really nice article.
We have a PostgreSQL server but we are experiencing really slow performance only on reading operations.
What may be the problem?
That is way too general to answer (and it doesn't seem to be connected with the article).
We offer consulting though.