August 2014

When writing data structures and table layouts, people usually don't care about the order of columns. Why should they anyway? Columns can be reordered inside a SELECT at will, so why care? Well, you should care because the order of columns might have significant impact on the size of your table. That's right: The size of the table might depend on the order of columns - even if the same data is stored.

Why is that so? Internally, there is something called CPU alignment which governs the internal size of a data structure. By cleverly aligning and ordering columns, there is some potential to optimize things. You don't believe it? Let us give it a try:

test=# CREATE TABLE t_test (
     i1       int,
     i2       int,
     i3       int, 
     v1       varchar(100),
     v2       varchar(100),
     v3       varchar(100)
);
CREATE TABLE

test=# CREATE TABLE t_test (

i1 int,

i2 int,

i3 int,

v1 varchar(100),

v2 varchar(100),

v3 varchar(100)

);

CREATE TABLE

In this example there are 6 columns. 3 integer columns close to each other and 3 varchar columns also close to each other. Let us add 10 million rows to the table:

test=# INSERT INTO t_test SELECT 10, 20, 30, 'abcd', 'abcd', 'abcd'
       FROM generate_series(1, 10000000);
INSERT 0 10000000

test=# INSERT INTO t_test SELECT 10, 20, 30, 'abcd', 'abcd', 'abcd'

FROM generate_series(1, 10000000);

INSERT 0 10000000

The overall size of the table is 574 MB:

test=# SELECT pg_size_pretty(pg_relation_size('t_test'));
 pg_size_pretty
----------------
 574 MB
(1 row)

test=# SELECT pg_size_pretty(pg_relation_size('t_test'));

pg_size_pretty

----------------

574 MB

(1 row)

Let us try to change the layout of those columns. In the next example there is a varchar field followed by an integer column. This layout is repeated three times:

test=# CREATE TABLE t_test (
     v1       varchar(100),
     i1       int,
     v2       varchar(100),
     i2       int,
     v3       varchar(100),
     i3       int
);
CREATE TABLE

test=# CREATE TABLE t_test (

v1 varchar(100),

i1 int,

v2 varchar(100),

i2 int,

v3 varchar(100),

i3 int

);

CREATE TABLE

If 10 million rows are added now …

test=# INSERT INTO t_test SELECT 'abcd', 10, 'abcd', 20, 'abcd', 30
       FROM generate_series(1, 10000000);
INSERT 0 10000000

… the table will be a lot larger:

test=# SELECT pg_size_pretty(pg_relation_size('t_test'));
 pg_size_pretty
----------------
 651 MB
(1 row)

test=# INSERT INTO t_test SELECT 'abcd', 10, 'abcd', 20, 'abcd', 30

FROM generate_series(1, 10000000);

INSERT 0 10000000

… the table will be a lot larger:

test=# SELECT pg_size_pretty(pg_relation_size('t_test'));

pg_size_pretty

----------------

651 MB

(1 row)

The data in the table is the same - it has just been chosen in a way to demonstrate the effect. If I had used “abc” instead of “abcd” there would be no difference in size because the 4 character string made sure that the string just did not fit into a smaller space.

Conclusion

The important conclusion here is that it definitely makes sense, to pack similar data types together. And, I have found it useful to pack integer columns at the beginning of the tables. In many cases this can yield in some extra percentage points of speed - just because things are a bit smaller than if you don't make use of this little rule.

Get some great hints and tips about indexes in my post, "Find and Fix a Missing PostgreSQL Index".

UPDATE September 2023: This feature is now called pg_resetwal, see the PostgreSQL documentation about it here. The blog below will be preserved for reference. See this blog by Laurenz Albe about corrupting databases to read more aboutpg_resetwal.

PostgreSQL has proven to be one of the most reliable Open Source databases around. In the past we have dealt with customers who have achieved superior uptimes and who are really happy with PostgreSQL. But even if PostgreSQL does its job day after day after day - there are some components, which can just fail and cause trouble. In short: Hardware may crash, filesystem can fail, and so on. In 99% of all cases there is a way around total disaster. Either you have a replica, which protected you from downtime or you simply have a reasonably good backup to help out in case of disaster. But what if there is really no option left anymore? You got to make this database work again - no matter what (even if it is just a partial or an unreliable restore). Maybe pg_resetxlog is your last helping hand in this case.

Killing an instance

As you might know the purpose of the transaction log (= xlog or WAL) is to restore order and consistency in case of a crash. If the xlog is broken, you might face disaster. This is exactly what we want 😉

So, to cause some disaster I sent a “kill -9” to a database system under load to stop it. In case of a restart this would lead to a recovery process and PostgreSQL would fix things on startup by replaying the transaction log. Let us disturb the replay process a little ...

[hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007C bs=1024 count=16384
16384+0 records in
16384+0 records out
16777216 bytes (17 MB) copied, 0.0280045 s, 599 MB/s

[hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007D bs=1024 count=16384
16384+0 records in
16384+0 records out
16777216 bytes (17 MB) copied, 0.0304789 s, 550 MB/s

[hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007F bs=1024 count=16384
16384+0 records in
16384+0 records out
16777216 bytes (17 MB) copied, 0.028223 s, 594 MB/s

[hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007C bs=1024 count=16384

16384+0 records in

16384+0 records out

16777216 bytes (17 MB) copied, 0.0280045 s, 599 MB/s

[hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007D bs=1024 count=16384

16384+0 records in

16384+0 records out

16777216 bytes (17 MB) copied, 0.0304789 s, 550 MB/s

[hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007F bs=1024 count=16384

16384+0 records in

16384+0 records out

16777216 bytes (17 MB) copied, 0.028223 s, 594 MB/s

What I have just done is really nasty. I have overwritten some transaction log files found in the pg_xlog directory with zeros. This should instantly make sure that the recovery process ends up in hell. Let us check if my attempts to really kill things has succeeded:

[hs@jacqueline test_db]$ pg_ctl -D /data start
server starting
LOG:  database system was shut down at 2014-08-08 12:17:00 CEST
LOG:  invalid magic number 0000 in log segment 00000001000000050000007B, offset 0
LOG:  invalid primary checkpoint record
LOG:  invalid magic number 0000 in log segment 00000001000000050000007B, offset 0
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 18807) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

[hs@jacqueline test_db]$ pg_ctl -D /data start

server starting

LOG: database system was shut down at 2014-08-08 12:17:00 CEST

LOG: invalid magic number 0000 in log segment 00000001000000050000007B, offset 0

LOG: invalid primary checkpoint record

LOG: invalid magic number 0000 in log segment 00000001000000050000007B, offset 0

LOG: invalid secondary checkpoint record

PANIC: could not locate a valid checkpoint record

LOG: startup process (PID 18807) was terminated by signal 6: Aborted

LOG: aborting startup due to startup process failure

Wow, this is exactly what I have intended to do. Nothing works anymore so it is time to fix things and demonstrate the power of pg_resetxlog.

pg_resetxlog at work

pg_resetxlog is a C program shipped with PostgreSQL. So, there is no need to install it on top of PostgreSQL. It is simply there and ready for action in case it is needed. Actually it is very powerful as the following help page shows:

[hs@jacqueline ~]$ pg_resetxlog --help
pg_resetwal resets the PostgreSQL write-ahead log.

Usage:
  pg_resetwal [OPTION]... DATADIR

Options:
 [-D, --pgdata=]DATADIR  data directory
  -f, --force            force update to be done even after unclean shutdown or
                         if pg_control values had to be guessed
  -n, --dry-run          no update, just show what would be done
  -V, --version          output version information, then exit
  -?, --help             show this help, then exit

Options to override control file values:
  -c, --commit-timestamp-ids=XID,XID
                                   set oldest and newest transactions bearing
                                   commit timestamp (zero means no change)
  -e, --epoch=XIDEPOCH             set next transaction ID epoch
  -l, --next-wal-file=WALFILE      set minimum starting location for new WAL
  -m, --multixact-ids=MXID,MXID    set next and oldest multitransaction ID
  -o, --next-oid=OID               set next OID
  -O, --multixact-offset=OFFSET    set next multitransaction offset
  -u, --oldest-transaction-id=XID  set oldest transaction ID
  -x, --next-transaction-id=XID    set next transaction ID
      --wal-segsize=SIZE           size of WAL segments, in megabytes

[hs@jacqueline ~]$ pg_resetxlog --help

pg_resetwal resets the PostgreSQL write-ahead log.

Usage:

pg_resetwal [OPTION]... DATADIR

Options:

[-D, --pgdata=]DATADIR data directory

-f, --force force update to be done even after unclean shutdown or

if pg_control values had to be guessed

-n, --dry-run no update, just show what would be done

-V, --version output version information, then exit

-?, --help show this help, then exit

Options to override control file values:

-c, --commit-timestamp-ids=XID,XID

set oldest and newest transactions bearing

commit timestamp (zero means no change)

-e, --epoch=XIDEPOCH set next transaction ID epoch

-l, --next-wal-file=WALFILE set minimum starting location for new WAL

-m, --multixact-ids=MXID,MXID set next and oldest multitransaction ID

-o, --next-oid=OID set next OID

-O, --multixact-offset=OFFSET set next multitransaction offset

-u, --oldest-transaction-id=XID set oldest transaction ID

-x, --next-transaction-id=XID set next transaction ID

--wal-segsize=SIZE size of WAL segments, in megabytes

It is possible to nicely configure pg_resetxlog to make it do precisely what is needed (maybe by just setting xlog back a little). However, in this example we don't care too much - we just want to reset the xlog completely:

[hs@jacqueline test_db]$ pg_resetxlog /data

Transaction log reset

[hs@jacqueline test_db]$ pg_resetxlog /data

Transaction log reset

One line is enough to reset the xlog of PostgreSQL.

Let us see if this has worked:

[hs@jacqueline test_db]$ pg_ctl -D . start
server starting
LOG:  database system was shut down at 2014-08-08 12:18:51 CEST
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

[hs@jacqueline test_db]$ pg_ctl -D . start

server starting

LOG: database system was shut down at 2014-08-08 12:18:51 CEST

LOG: database system is ready to accept connections

LOG: autovacuum launcher started

Voila, the database was able to start up again. The autovacuum daemon has been launched nicely and the system is able to accept connection. A simple test reveals that all databases seem to be around:

[hs@jacqueline test_db]$ psql -l
                              List of databases
   Name    | Owner | Encoding |   Collate   |    Ctype    | Access privileges
-----------+-------+----------+-------------+-------------+-------------------
 postgres  | hs    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0 | hs    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/hs            +
           |       |          |             |             | hs=CTc/hs
 template1 | hs    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/hs            +
           |       |          |             |             | hs=CTc/hs
 test      | hs    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
(4 rows)

[hs@jacqueline test_db]$ psql -l

List of databases

-----------+-------+----------+-------------+-------------+-------------------

template0 | hs | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/hs +

template1 | hs | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/hs +

(4 rows)

So, where is the problem?

Be cautious

pg_resetxlog is really the last line of defence. If you have to call it, you should expect some data to be lost. Maybe you will face slightly broken tables or corrupted indexes. You should not really trust this database anymore - however, you will have the chance to extract some data, restore at least something.

Under any circumstances: If you are forced to use pg_resetxlog I would really advise to not continue with the database instance in doubt, take a backup, PROPERLY check the data and start over with a new database instance.

With PostgreSQL 9.4 just around the corner and 8.4 just retired I thought it is time to review those years of development to see how the size of the PostgreSQL code base has evolved over time. In the past couple of years (I even already got to say “decades”) a lot of development has happened and all those changes had a significant impact on the sheer volume of the code.

Out of plain curiosity I decided to start a little analysis regarding the code size of the PostgreSQL project over time.

To start the analysis I decided to count the lines of plain C code, all header files as well as grammar files (.y). Here is the shell command used on all versions:

find . -type f -name "*.[chy]" | xargs -I {} cat {} | wc -l

1	find . -type f -name "*.[chy]" \| xargs -I {} cat {} \| wc -l

The results are pretty interesting:
9.3.5: 1190501
9.2.9: 1146762
9.1.14: 1091400
9.0.18: 1018774
8.4.22: 954521

The interesting thing here to see is that the size of the code has grown fairly consistently over time. From one major release to the next we see a constant increase in code source, which is more or less stable over the years.

Historically...

Just out of curiosity: On 29^th of January 1997 PostgreSQL 6.0 was released. A long long time ago. Guess what the size of the code was back then.

The answer is? 185437 lines

PostgreSQL has come a long way since then.

UPDATED JULY 19, 2023 - When I do training here at CYBERTEC, people often ask how to load data fast and efficiently. There are many ways to achieve this. One way not too many know about is COPY (FREEZE). It makes sure that PostgreSQL can organize data nicely straight away by instantly writing hint bits (as explained in my previous post).

Sample data for our bulk load test

To demonstrate COPY (FREEZE), first generate some data:

$ cat numbers.pl
#!/usr/bin/perl

for ($count = 1; $count <= 500000000; $count++)
{
	print '$countt$countn';
}

$ cat numbers.pl

#!/usr/bin/perl

for ($count = 1; $count <= 500000000; $count++)

{

print '$countt$countn';

}

We will load the data into this table:

CREATE TABLE t_test (
   a integer,
   b integer
);

CREATE TABLE t_test (

a integer,

b integer

);

Importing the data with `COPY`

To import the data on my test system (single SATA disk in a Linux box) I use a normal COPY:

test=# COPY t_test FROM '/data/numbers.txt';
COPY 500000000
Time: 627509.061 m

test=# COPY t_test FROM '/data/numbers.txt';

COPY 500000000

Time: 627509.061 m

As you can see, it takes around 10 minutes to do so.

Speeding up the load with `COPY (FREEZE)`

To speed things up, let's try COPY (FREEZE):

test=# COPY t_test FROM '/data/numbers.txt' (FREEZE);
ERROR:  cannot perform COPY FREEZE because the table was not created or truncated 
in the current subtransaction

test=# COPY t_test FROM '/data/numbers.txt' (FREEZE);

ERROR: cannot perform COPY FREEZE because the table was not created or truncated

in the current subtransaction

The important thing here is that this command can only operate inside the transaction that created or truncated the table. Otherwise, it doesn't work (I'll explain the reason later). So, let's try the following:

test=# BEGIN;
BEGIN

test=*# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 2/A7D6310
(1 row)

test=*# TRUNCATE t_test;
TRUNCATE

test=*# timing
Timing is on.

test=*# COPY t_test FROM '/data/numbers.txt' FREEZE;
COPY 500000000
Time: 304082.907 ms

test=*# timing
Timing is off.

test=*# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 2/A7DC000
(1 row)

test=*# COMMIT;
COMMIT

test=# BEGIN;

BEGIN

test=*# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

2/A7D6310

(1 row)

test=*# TRUNCATE t_test;

TRUNCATE

test=*# timing

Timing is on.

test=*# COPY t_test FROM '/data/numbers.txt' FREEZE;

COPY 500000000

Time: 304082.907 ms

test=*# timing

Timing is off.

test=*# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

2/A7DC000

(1 row)

test=*# COMMIT;

COMMIT

The execution time goes down dramatically!

Explanation for the speed of `COPY (FREEZE)`

The main reason it's now more efficient is that PostgreSQL did not have to write transaction log data. That's because I configured wal_level = minimal. With this setting, you can't use WAL archiving or replication, and PostgreSQL only logs enough data for crash recovery. Because of that, PostgreSQL does not need to write WAL for data modifications that take place in the transaction that created or truncated the table. Skipping WAL writes speeds up data loading considerably! Changing wal_level requires a restart of the database, but it may be worth the pain if you need to load data quickly.

But there's also another gain: The second major benefit will be visible when the data is read for the first time after the import. As already shown in my previous posts, PostgreSQL has to set hint bits during the first reading operation. This translates to a large number of writes later on down the road.

COPY (FREEZE) has been made to fix exactly this kind of problem - the data are already loaded in the frozen state, and there is no need to write hint bits any more. Frozen rows are unconditionally visible, and that is the reason why the table had to be created or truncated in the same transaction: otherwise, concurrent readers could see the rows before the transaction commits, which would violate transaction isolation.

Analytical applications can benefit greatly
from the COPY (FREEZE) feature!

Finally...

I’ve recently seen some really broad tables (hundreds of columns) in a somewhat inefficiently structured database. Our PostgreSQL support customer complained about strange runtime behavior which could not be easily explained. To help other PostgreSQL users in this same situation, I decided to reveal the secrets of a fairly common performance problem many people don’t understand: Column order and column access.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Post UPDATED June 2023: PostgreSQL is a highly sophisticated relational database system capable of performing wonderful operations. But, sophistication also means that there is a bit of complexity under the surface, which is not always well understood by users. One thing people usually don't know about are hint bits.

What are hint bits?

Actually, they are an internal optimization which makes sure that visibility checks are sped up considerably. As you might know, PostgreSQL traditionally had to decide for each row whether it was visible to a user or not. The best way to speed up the process is to avoid that - and that is why hint bits were invented.

When PostgreSQL finds a row that is visible to all transactions, it sets a bit in the row header indicating this fact. Next time someone finds this row, they can avoid comparing the row visibility information to their snapshot, because the hint bit already tells them that it must be visible.

Creating some demo data

To demonstrate how things work, let's create some demo data:

test=# CREATE TABLE t_test (a int, b int);
CREATE TABLE

1 2	test=# CREATE TABLE t_test (a int, b int); CREATE TABLE

A simple Perl script can create the data nicely. In this example, half a billion rows are used (big enough to demonstrate the point and yet small enough to do the test in reasonable time):

[hs@jacqueline blog]$ cat numbers.pl
#!/usr/bin/perl

for ($count = 1; $count <= 500000000; $count++)
{
            print '$count   $countn';
}

[hs@jacqueline blog]$ cat numbers.pl

#!/usr/bin/perl

for ($count = 1; $count <= 500000000; $count++)

{

print '$count $countn';

}

To ensure that autovacuum does not spoil our day, it is turned off for this test:

test=# ALTER TABLE t_test SET (autovacuum_enabled = off);
ALTER TABLE

1 2	test=# ALTER TABLE t_test SET (autovacuum_enabled = off); ALTER TABLE

Then data is loaded into PostgreSQL. The easiest way to do that is to use “COPY … FROM PROGRAM” and to simply read things in through a pipe. It works like this:

test=# \timing
Timing is on.

test=# COPY t_test FROM PROGRAM '/path/numbers.pl';
COPY 500000000
Time: 612716.438 ms

test=# \timing

Timing is on.

test=# COPY t_test FROM PROGRAM '/path/numbers.pl';

COPY 500000000

Time: 612716.438 ms

On our system...

(some Intel i5 and a 750 GB SATA disk) it takes around 10 minutes to load the data. Note that default PostgreSQL parameters were used (no checkpoint optimizations). “top” leaves us with output as shown in the next listing:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                
27570 hs        20   0  257m  24m  23m R 48.6  0.2   0:10.33 postgres                                                                                               
27574 hs        20   0  119m 1584 1252 R 31.6  0.0   0:06.66 numbers.pl                                                                                             
27356 hs        20   0  256m  16m  16m S  1.3  0.1   0:00.18 postgres

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

27570 hs 20 0 257m 24m 23m R 48.6 0.2 0:10.33 postgres

27574 hs 20 0 119m 1584 1252 R 31.6 0.0 0:06.66 numbers.pl

27356 hs 20 0 256m 16m 16m S 1.3 0.1 0:00.18 postgres

You can clearly see that generating numbers actually causes a decent share of the CPU load. Actually, this is good - because it means that PostgreSQL is cheap, relative to the data creation part.

Hint bits: first contact …

Up until now, nothing special could be observed. Let's try to run a SELECT:

test=# SELECT count(*) FROM t_test;
   count  
-----------
 500000000
(1 row)

 

Time: 359669.393 ms

test=# SELECT count(*) FROM t_test;

count

-----------

500000000

(1 row)

Time: 359669.393 ms

The interesting part here is not the result of the query. The really interesting part is hidden nicely below the surface.

Here is vmstat:

[hs@jacqueline blog]$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0    164 4968092  46412 10534028    0    0  3486  1903    3   13  1  0 91  7  0 
 0  0    164 4967960  46412 10534144    0    0     0     0   78  225  0  0 100  0  0   
 0  0    164 4967960  46420 10534144    0    0     0     8   72  203  0  0 99  1  0     
 0  0    164 4967960  46420 10534144    0    0     0     0  103  301  0  0 100  0  0  
 0  1    164 4946260  46420 10555520    0    0 10586    20  246  437  1  0 97  2  0           
 1  1    164 4732608  46420 10763488    0    0 104000     0 1403 2046 10  2 76 12  0      
 0  1    164 4517592  46420 10972988    0    0 104704     0 1415 2042 10  3 76 11  0      
 1  0    164 4301832  46420 11183036    0    0 105024     0 1410 2038 10  3 75 12  0       
 1  0    164 4082228  46420 11396740    0    0 106880     4 1449 2048 11  3 75 11  0       
 0  1    164 3868576  46420 11604996    0    0 104064     0 1414 2050 10  3 75 12  0      
 1  0    164 3650460  46420 11817164    0    0 106176     0 1415 2058 11  2 75 12  0       
 0  1    164 3433460  46420 12028640    0    0 105664     0 1419 2078 10  2 75 12  0      
 1  1    164 3262588  46420 12194508    0    0 82944 45356 1199 2209  8  3 66 23  0     
 1  1    164 3143176  46420 12310644    0    0 58112 43818  949 1738  6  2 75 18  0      
 1  1    164 3018680  46420 12432100    0    0 60736 36900  940 1760  6  2 75 17  0      
 1  2    164 2897284  46420 12550404    0    0 59136 29726  917 1727  6  2 75 16  0      
 0  2    164 2775020  46420 12669348    0    0 59392 36642  912 1746  6  2 76 17  0      
 0  2    164 2656848  46420 12784480    0    0 57600 36388  883 1687  6  2 75 17  0      
 1  1    164 2537560  46420 12900636    0    0 58048 36644  937 1755  5  2 73 19  0

[hs@jacqueline blog]$ vmstat 2

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

r b swpd free buff cache si so bi bo in cs us sy id wa st

1 0 164 4968092 46412 10534028 0 0 3486 1903 3 13 1 0 91 7 0

0 0 164 4967960 46412 10534144 0 0 0 0 78 225 0 0 100 0 0

0 0 164 4967960 46420 10534144 0 0 0 8 72 203 0 0 99 1 0

0 0 164 4967960 46420 10534144 0 0 0 0 103 301 0 0 100 0 0

0 1 164 4946260 46420 10555520 0 0 10586 20 246 437 1 0 97 2 0

1 1 164 4732608 46420 10763488 0 0 104000 0 1403 2046 10 2 76 12 0

0 1 164 4517592 46420 10972988 0 0 104704 0 1415 2042 10 3 76 11 0

1 0 164 4301832 46420 11183036 0 0 105024 0 1410 2038 10 3 75 12 0

1 0 164 4082228 46420 11396740 0 0 106880 4 1449 2048 11 3 75 11 0

0 1 164 3868576 46420 11604996 0 0 104064 0 1414 2050 10 3 75 12 0

1 0 164 3650460 46420 11817164 0 0 106176 0 1415 2058 11 2 75 12 0

0 1 164 3433460 46420 12028640 0 0 105664 0 1419 2078 10 2 75 12 0

1 1 164 3262588 46420 12194508 0 0 82944 45356 1199 2209 8 3 66 23 0

1 1 164 3143176 46420 12310644 0 0 58112 43818 949 1738 6 2 75 18 0

1 1 164 3018680 46420 12432100 0 0 60736 36900 940 1760 6 2 75 17 0

1 2 164 2897284 46420 12550404 0 0 59136 29726 917 1727 6 2 75 16 0

0 2 164 2775020 46420 12669348 0 0 59392 36642 912 1746 6 2 76 17 0

0 2 164 2656848 46420 12784480 0 0 57600 36388 883 1687 6 2 75 17 0

1 1 164 2537560 46420 12900636 0 0 58048 36644 937 1755 5 2 73 19 0

“bi” (= blocks in) and “bo” (= blocks out) are the two columns of interest here.

At first, things look pretty normal.

We start to read at roughly 100 MB / second, a reasonable value for a desktop PC featuring a single SATA drive. But then, things start to go south. At some point, PostgreSQL starts to write data out to disk. Why that? Well, when a row is loaded, PostgreSQL has no idea whether it is visible to others without checking visibility all over again. Therefore PostgreSQL will mark a row as “visible to everyone” when it's read for the first time. This is exactly where the I/O comes from - when rows are marked as “visible to everybody”. The advantage here is that visibility checks are sped up tremendously.

When the query is executed again, the I/O pattern is totally different:

[hs@jacqueline blog]$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st         
 0  1    164 191076   1688 15325392    0    0 104256     0 1401 2077  8  2 75 15  0          
 0  1    164 183140   1688 15334056    0    0 105792     0 1394 2113  8  2 75 15  0          
 0  1    164 187728   1688 15329596    0    0 106880     0 1429 2121  9  2 75 15  0          
 0  1    164 190828   1688 15327804    0    0 105472     0 1413 2124  8  2 75 15  0          
 2  0    164 184008   1688 15335792    0    0 104640     0 1389 2108  8  2 75 15  0          
 1  0    164 188348   1688 15336880    0    0 106240     0 1466 2107  9  2 75 14  0          
 1  0    164 191944   1692 15337824    0    0 100226     4 1372 2060  8  2 75 15  0

[hs@jacqueline blog]$ vmstat 2

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

r b swpd free buff cache si so bi bo in cs us sy id wa st

0 1 164 191076 1688 15325392 0 0 104256 0 1401 2077 8 2 75 15 0

0 1 164 183140 1688 15334056 0 0 105792 0 1394 2113 8 2 75 15 0

0 1 164 187728 1688 15329596 0 0 106880 0 1429 2121 9 2 75 15 0

0 1 164 190828 1688 15327804 0 0 105472 0 1413 2124 8 2 75 15 0

2 0 164 184008 1688 15335792 0 0 104640 0 1389 2108 8 2 75 15 0

1 0 164 188348 1688 15336880 0 0 106240 0 1466 2107 9 2 75 14 0

1 0 164 191944 1692 15337824 0 0 100226 4 1372 2060 8 2 75 15 0

As you can see, there is no more writing going on - just a clean 100+ MB / second. Exactly what we want to see.

The read is a lot faster now:

test=# SELECT count(*) FROM t_test;
   count  
-----------
 500000000
(1 row)

Time: 110263.714 ms

test=# SELECT count(*) FROM t_test;

count

-----------

500000000

(1 row)

Time: 110263.714 ms

Wow, the time needed to perform this one has gone down dramatically! And more importantly, the execution time is pretty stable:

Time: 110382.362 ms

1	Time: 110382.362 ms

The theory behind it is that reads happen more frequently than writes, and therefore investing in hint bits makes an awful lot of sense, because it seriously improves run-times.

Turning to VACUUM

But there is a bit more. What if we VACUUM the table?

Here is vmstat again:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1    164 161936    616 15427528    0    0 50718   356  845 1262  4  2 79 15  0
 0  1    164 168136    616 15421384    0    0 87680   636 1267 2517  6  4 75 15  0           
 0  1    164 169624    616 15419204    0    0 88704   624 1254 2529  6  3 75 15  0           
 0  1    164 179296    616 15409532    0    0 90304   560 1279 2435  6  4 75 15  0           
 0  1    164 185868    616 15402548    0    0 89216   628 1319 1898  7  4 74 15  0           
 1  1    164 178924    616 15409580    0    0 88832   628 1295 1960  7  3 75 15  0           
 0  1    164 183636    616 15404972    0    0 80166   624 1196 1802  6  3 75 16  0
 0  1    164 186240    616 15402652    0    0 91648   568 1354 1951  7  4 75 15  0           
 0  2    164 180660    624 15408296    0    0 89472   670 1307 1982  6  3 75 16  0           
 1  1    164 170120    624 15418500    0    0 72260 49908 1080 2035  5  3 70 22  0         
 0  2    164 164788    624 15423656    0    0 67072 22782  995 1955  4  2 70 23  0          
 0  2    164 168012    624 15420556    0    0 61888 22492  984 1928  4  2 73 20  0          
 0  2    164 164044    624 15424496    0    0 60672 29582  959 1887  4  2 73 20  0          
 0  2    164 172352    624 15416452    0    0 67712 22524 1053 2097  5  3 72 20  0

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

r b swpd free buff cache si so bi bo in cs us sy id wa st

1 1 164 161936 616 15427528 0 0 50718 356 845 1262 4 2 79 15 0

0 1 164 168136 616 15421384 0 0 87680 636 1267 2517 6 4 75 15 0

0 1 164 169624 616 15419204 0 0 88704 624 1254 2529 6 3 75 15 0

0 1 164 179296 616 15409532 0 0 90304 560 1279 2435 6 4 75 15 0

0 1 164 185868 616 15402548 0 0 89216 628 1319 1898 7 4 74 15 0

1 1 164 178924 616 15409580 0 0 88832 628 1295 1960 7 3 75 15 0

0 1 164 183636 616 15404972 0 0 80166 624 1196 1802 6 3 75 16 0

0 1 164 186240 616 15402652 0 0 91648 568 1354 1951 7 4 75 15 0

0 2 164 180660 624 15408296 0 0 89472 670 1307 1982 6 3 75 16 0

1 1 164 170120 624 15418500 0 0 72260 49908 1080 2035 5 3 70 22 0

0 2 164 164788 624 15423656 0 0 67072 22782 995 1955 4 2 70 23 0

0 2 164 168012 624 15420556 0 0 61888 22492 984 1928 4 2 73 20 0

0 2 164 164044 624 15424496 0 0 60672 29582 959 1887 4 2 73 20 0

0 2 164 172352 624 15416452 0 0 67712 22524 1053 2097 5 3 72 20 0

Again, some writing is going on. We are seriously bound by the disk here, as the following couple of lines taken from “top” indicate:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND                                                     27570 hs        20   0  257m  30m  29m R 26.3  0.2   9:34.80 postgres                                                                                                
   59 root      20   0     0    0    0 S  1.7  0.0  18:00.13 kswapd0                                                                                                
   36 root      20   0     0    0    0 S  1.0  0.0  10:42.29 kblockd/2                                                                                               
27245 root      20   0     0    0    0 D  0.3  0.0   0:03.53 flush-8:16                                                                                             
30861 root      20   0 15036 1196  836 R  0.3  0.0   0:03.75 top

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27570 hs 20 0 257m 30m 29m R 26.3 0.2 9:34.80 postgres

59 root 20 0 0 0 0 S 1.7 0.0 18:00.13 kswapd0

36 root 20 0 0 0 0 S 1.0 0.0 10:42.29 kblockd/2

27245 root 20 0 0 0 0 D 0.3 0.0 0:03.53 flush-8:16

30861 root 20 0 15036 1196 836 R 0.3 0.0 0:03.75 top

VACUUM takes quite a while as well:

test=# VACUUM t_test;
VACUUM
Time: 363773.532 ms

test=# VACUUM t_test;

VACUUM

Time: 363773.532 ms

The beauty now is that we not only know whether a single row is visible or not - VACUUM can also figure out whether an entire block is visible or not.

Again, we see an improvement in speed - but this time, it's not very big (just two seconds):

test=# SELECT count(*) FROM t_test;
   count  
-----------
 500000000
(1 row)

Time: 108715.332 ms

test=# SELECT count(*) FROM t_test;

count

-----------

500000000

(1 row)

Time: 108715.332 ms

The main benefit really comes from those bits attached to a single line - and this benefit is massive.

Finally...

If you liked this article, check out the next article in this series: Speed up PostgreSQL data loading with COPY (FREEZE)

For further information on how to speed up performance in PostgreSQL, see this post about CLUSTER and HOT updates, or see Laurenz Albe's first article on HOT updates.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Conclusion

Killing an instance

pg_resetxlog at work

Be cautious

Historically...

Sample data for our bulk load test

Importing the data with COPY

Speeding up the load with COPY (FREEZE)

Explanation for the speed of COPY (FREEZE)

Finally...

What are hint bits?

Creating some demo data

To ensure that autovacuum does not spoil our day, it is turned off for this test:

On our system...

Hint bits: first contact …

Here is vmstat:

At first, things look pretty normal.

When the query is executed again, the I/O pattern is totally different:

The read is a lot faster now:

Turning to VACUUM

Here is vmstat again:

VACUUM takes quite a while as well:

Finally...

Importing the data with `COPY`

Speeding up the load with `COPY (FREEZE)`

Explanation for the speed of `COPY (FREEZE)`