Tips and tricks to kick-start the Postgres year 2021

01.2021 | Category: How To | Tags: query, sql help

01.2021

Category: How To

Tags: query, sql help

by Kaarel Moppel

Table of Contents

I decided to start out this year by looking into my notes from last year and extracting from them a small set of Postgres tips for you. This might not be, strictly speaking, fresh news... but I guess the average RAM of tech workers is about 64KB or so nowadays, so some repetition might not be such a bad thing.

My Postgres tips: Partial indexes

This is a huge one – and you're really missing out if you've not heard of this feature or are not using it enough... as too often seems to be the case in my daily consulting work! Thus this was the first item on my list from last year. For larger amounts of data, partial indexes can literally be a life-saving trick when your back is already against the wall due to some unexpected increase in workload or disk usage. One of the reasons this feature is relatively unknown is that most other popular DBMS engines don't have it at all, or they call it by another name.

Stop indexing the most common values!

The crux of the thing is super easy to remember – stop indexing the most common values! Since Postgres knows what your data looks like, it does not use the index whenever you search for a value that’s too common! As always, when you reduce indexes, you're not just winning on the storage - you can also avoid the penalties incurred when you're inserting or modifying the data. So: with partial indexes, when a new column contains the most common value, we don't have to go and touch the index at all!

When does a value become “too common”? Sadly, there is no simple rule for this, as it depends on a couple of other config/data layout variables as well as the actual queries themselves - but roughly, starting at about 25-50% of the total rows, Postgres will tend to ignore the indexes.

Here is a small test table with 100 million rows to help you visualize the possible size benefits.

CREATE UNLOGGED TABLE t_event (
    id int GENERATED BY DEFAULT AS IDENTITY,
    state text NOT NULL,
    data jsonb NOT NULL
);


/* 1% NEW, 4% IN_PROCESSING, 95% FINISHED */
INSERT INTO t_event (state, data)
SELECT
    CASE WHEN random() < 0.95 THEN
        'FINISHED'
    ELSE
        CASE WHEN random() < 0.8 THEN
            'IN_PROCESSING'
        ELSE
            'NEW'
        END
    END,
    '{}'
FROM
    generate_series(1, 5 * 1e7);


postgres=# dt+
                      List of relations
 Schema |  Name   | Type  |  Owner   |  Size   | Description 
--------+---------+-------+----------+---------+-------------
 public | t_event | table | postgres | 2489 MB | 
(1 row)

CREATE INDEX ON t_event (state);

CREATE INDEX ON t_event (state)
WHERE
    state <> 'FINISHED';
postgres=# di+ t_event_state_*
                                List of relations
 Schema |        Name        | Type  |  Owner   |  Table  |  Size   | Description 
--------+--------------------+-------+----------+---------+---------+-------------
 public | t_event_state_idx  | index | postgres | t_event | 1502 MB | 
 public | t_event_state_idx1 | index | postgres | t_event | 71 MB   | 
(2 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

CREATE UNLOGGED TABLE t_event (

id int GENERATED BY DEFAULT AS IDENTITY,

state text NOT NULL,

data jsonb NOT NULL

);

/* 1% NEW, 4% IN_PROCESSING, 95% FINISHED */

INSERT INTO t_event (state, data)

SELECT

CASE WHEN random() < 0.95 THEN

'FINISHED'

ELSE

CASE WHEN random() < 0.8 THEN

'IN_PROCESSING'

ELSE

'NEW'

END

END,

'{}'

FROM

generate_series(1, 5 * 1e7);

postgres=# dt+

List of relations

Schema | Name | Type | Owner | Size | Description

--------+---------+-------+----------+---------+-------------

public | t_event | table | postgres | 2489 MB |

(1 row)

CREATE INDEX ON t_event (state);

CREATE INDEX ON t_event (state)

WHERE

state <> 'FINISHED';

postgres=# di+ t_event_state_*

List of relations

Schema | Name | Type | Owner | Table | Size | Description

--------+--------------------+-------+----------+---------+---------+-------------

public | t_event_state_idx | index | postgres | t_event | 1502 MB |

public | t_event_state_idx1 | index | postgres | t_event | 71 MB |

(2 rows)

Another tip on partial indexes (to finish off the topic) – usually, another perfect set of candidates are any columns that are left mostly empty, i.e., "NULL". The thing is that unlike Oracle, PostgreSQL has default all-encompassing indexes, so that all the NULL-s are physically stored in the index!

Estimating average row/record size in bytes

Quite often, you will need to evaluate the approximate size of a number of rows of data for capacity planning reasons. In these cases, there are a couple of approaches. The simplest ones consist of looking at the table size and row count and then doing some elementary math. However, the simplest approaches can be too rough and inaccurate – most often due to bloat; another reason could be that some historical data might not have some recently added columns (for example, due to some peculiarities of when Postgres built-in compression kicks in, etc.) The fastest way to get an estimate is to use the EXPLAIN command. EXPLAIN has the information embedded on the "cost" side - but do note that the estimate may use stale statistics, and it's generally pretty inaccurate on "toasted" columns.

explain select * from pgbench_accounts a;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Seq Scan on pgbench_accounts a  (cost=0.00..2650.66 rows=100366 width=97)
(1 row)

1

2

3

4

5

explain select * from pgbench_accounts a;

QUERY PLAN

---------------------------------------------------------------------------

Seq Scan on pgbench_accounts a (cost=0.00..2650.66 rows=100366 width=97)

(1 row)

So what do you do if you want to estimate the storage size of some recent data accurately?

The function listed below, which answers the question raised above, is a welcome discovery for most people when they see it in action. Probably, not many people know about this due to the somewhat "hidden" function name (naming things is one of the two most difficult problems in computer science 🙂 ), and due to the fact that we're abusing the function - which is expecting a column - by feeding in a whole row! Remember: in Postgres all tables automatically get a virtual table type with all columns that is kind of a "scalar" - if they are not, for example, unpacked with "t.*".

select avg(pg_column_size(a)) from pgbench_accounts a;
         avg          
----------------------
 120.9999700000000000
(1 row)

1

2

3

4

5

select avg(pg_column_size(a)) from pgbench_accounts a;

avg

----------------------

120.9999700000000000

(1 row)

The advantage of this function is that we can exactly specify and inspect the data that we're interested in. It also handles the disk storage side of things like compression and row headers!

Keep in mind that the size determined still does not usually map 1-to-1 to real-life disk usage for larger amounts of data due to good old companions like "bloat" and normal "defragmentation" or "air bubbles". So if you want super-sharp estimates, there's no way around generating some real-life test data. But do use a TEMP or UNLOGGED table if you do so; no need to create a huge WAL spike...especially if you happen to run a bunch of replicas.

Avoiding risk on upgrading older Postgres instances to new major versions

A fun (or maybe sad) fact from the Postgres consulting trenches - most Postgres users are running some pretty darn old versions! One of many reasons behind that is commonly voiced as, “We cannot stop our business for so and so many minutes.” I say - fair enough, minutes are bad...but with Logical Replication (LR) we're talking about seconds!!! It’s no magic, far from it - the built-in LR introduced in v10 couldn't be any simpler! We’ve performed many such migrations. In most instances, everything functioned seamlessly, just as planned! However, there could be issues if the verification or switchover phase is extended for too long.

To sum it up, when it comes to LR, there's truly no justification for using outdated versions—especially considering how quickly technology evolves in the digital age!. If my article got you curious and you want to learn more about this topic, I'd suggest starting here.

By the way, if you're on some pre v10 instance, but higher or equal to v9.4, then logical upgrades are also possible via a 3rd party plugin called "pglogical", which often worked quite well when I used it some years ago.

Detecting which index types are available for my column type

What if you’re unhappy with your indexing performance? Maybe there are some other more cool index types available for your column? Postgres has a bunch of those, actually...

postgres=# SELECT DISTINCT
    a.amname
FROM
    pg_type t
    JOIN pg_opclass o ON o.opcintype = t.oid
    JOIN pg_am a ON a.oid = o.opcmethod
WHERE
    t.typname = 'int4';

 amname 
--------
 btree
 hash
 brin
(3 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

postgres=# SELECT DISTINCT

a.amname

FROM

pg_type t

JOIN pg_opclass o ON o.opcintype = t.oid

JOIN pg_am a ON a.oid = o.opcmethod

WHERE

t.typname = 'int4';

amname

--------

btree

hash

brin

(3 rows)

And that's not all - after declaring some "magic" extensions:

CREATE EXTENSION btree_gin;
CREATE EXTENSION btree_gist;

1 2	CREATE EXTENSION btree_gin; CREATE EXTENSION btree_gist;

the picture changes to something like that below. Quite a lot of stuff to choose from for our good old integer!

postgres=# SELECT DISTINCT
    a.amname
FROM
    pg_type t
    JOIN pg_opclass o ON o.opcintype = t.oid
    JOIN pg_am a ON a.oid = o.opcmethod
WHERE
    t.typname = 'int4';

 amname 
--------
 bloom
 brin
 btree
 gin
 gist
 hash
(6 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

postgres=# SELECT DISTINCT

a.amname

FROM

pg_type t

JOIN pg_opclass o ON o.opcintype = t.oid

JOIN pg_am a ON a.oid = o.opcmethod

WHERE

t.typname = 'int4';

amname

--------

bloom

brin

btree

gin

gist

hash

(6 rows)

By the way, if you're thinking, “What the hell are all those weird index types? Why should I give them a chance?”, then I would recommend starting here and here.

Inspecting metadata at the speed of light with psql's "gdesc"

Since we are talking about data types...how do we determine what is actually delivered to us by the Postgres engine in tabular format? What type of data is in column XYZ for example?

The thing is that sometimes you get some above-average funky query sent to you where developers are having difficulties with "you might need to add explicit type casts" complaints from the server, converting some end results of a dozen sub-selects and transformations, so that the original column and its data type have already fallen into an abyss. Or, maybe the ORM needs an exact data type specified for your query, but runtime metadata introspection might be a painful task in the programming language at hand.

So here’s a quick tip on how to employ a PostgreSQL DBA's best friend "psql" (sorry, doggos) for that purpose:

postgres=# SELECT 'jambo' mambo, 1 a, random(), array[1,2];
 mambo | a |       random        | array 
-------+---+---------------------+-------
 jambo | 1 | 0.28512335791514687 | {1,2}
(1 row)

postgres=# gdesc
 Column |       Type       
--------+------------------
 mambo  | text
 a      | integer
 random | double precision
 array  | integer[]
(4 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

postgres=# SELECT 'jambo' mambo, 1 a, random(), array[1,2];

mambo | a | random | array

-------+---+---------------------+-------

jambo | 1 | 0.28512335791514687 | {1,2}

(1 row)

postgres=# gdesc

Column | Type

--------+------------------

mambo | text

a | integer

random | double precision

array | integer[]

(4 rows)

This only works starting from v11 of Postgres.

Thanks for reading, and hopefully it got you back on track to start learning some new Postgres stuff again in 2021! Check out last years blogpost for more tips.

4 responses to “Tips and tricks to kick-start the Postgres year 2021”

Cristian says:

February 3, 2021 at 8:53 am

Hi Kareel,
really nice article.
We have a PostgreSQL server but we are experiencing really slow performance only on reading operations.
What may be the problem?

Reply
- laurenz says:
  
  February 3, 2021 at 9:39 am
  
  That is way too general to answer (and it doesn't seem to be connected with the article).
  
  We offer consulting though.
  
  Reply
lessard jean-marc says:

March 25, 2021 at 1:51 pm

Hi Kareel,

I want to thank you for your blog, there is always good information.

Partial index are nice but I am concerned about index bloat and fragmentation on columns that contains dynamic data (updated many times like a status).
Would an index on such kind of column may experience bloat and fragmentation issues?
I would like to know your advice.

Reply
- Kaarel says:
  
  March 29, 2021 at 10:10 am
  
  Hey, thanks! Yes your thinking is valid - avoiding indexing on frequently updated columns is a good idea if possible. Normal, non-HOT updates, always contribute to index bloat in Postgres. To make it a bit better though and reduce page splits the default fillfactor for indexes is 90% by default.
  
  Reply

Leave a Reply Cancel reply

Stay tuned with our

NEWSLETTER

CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0

office@cybertec-postgresql.com

ISO_27001_Badge

Customer Support

Support Platform

Services

Support CYBERTEC Partner PostgreSQL Books

Company

STAY TUNED WITH OUR NEWSLETTER

Get the newest PostgreSQL Info & Tools

Data Protection Policy Terms and conditions Terms of Service Imprint

©

2025

CYBERTEC PostgreSQL International GmbH