When writing data structures and table layouts, people usually don't care about the order of columns. Why should they anyway? Columns can be reordered inside a SELECT at will, so why care? Well, you should care because the order of columns might have significant impact on the size of your table. That's right: The size of the table might depend on the order of columns - even if the same data is stored.
Why is that so? Internally, there is something called CPU alignment which governs the internal size of a data structure. By cleverly aligning and ordering columns, there is some potential to optimize things. You don't believe it? Let us give it a try:
1 2 3 4 5 6 7 8 9 |
test=# CREATE TABLE t_test ( i1 int, i2 int, i3 int, v1 varchar(100), v2 varchar(100), v3 varchar(100) ); CREATE TABLE |
In this example there are 6 columns. 3 integer columns close to each other and 3 varchar columns also close to each other. Let us add 10 million rows to the table:
1 2 3 |
test=# INSERT INTO t_test SELECT 10, 20, 30, 'abcd', 'abcd', 'abcd' FROM generate_series(1, 10000000); INSERT 0 10000000 |
The overall size of the table is 574 MB:
1 2 3 4 5 |
test=# SELECT pg_size_pretty(pg_relation_size('t_test')); pg_size_pretty ---------------- 574 MB (1 row) |
Let us try to change the layout of those columns. In the next example there is a varchar field followed by an integer column. This layout is repeated three times:
1 2 3 4 5 6 7 8 9 |
test=# CREATE TABLE t_test ( v1 varchar(100), i1 int, v2 varchar(100), i2 int, v3 varchar(100), i3 int ); CREATE TABLE |
If 10 million rows are added now …
1 2 3 4 5 6 7 8 9 10 11 |
test=# INSERT INTO t_test SELECT 'abcd', 10, 'abcd', 20, 'abcd', 30 FROM generate_series(1, 10000000); INSERT 0 10000000 … the table will be a lot larger: test=# SELECT pg_size_pretty(pg_relation_size('t_test')); pg_size_pretty ---------------- 651 MB (1 row) |
The data in the table is the same - it has just been chosen in a way to demonstrate the effect. If I had used “abc” instead of “abcd” there would be no difference in size because the 4 character string made sure that the string just did not fit into a smaller space.
The important conclusion here is that it definitely makes sense, to pack similar data types together. And, I have found it useful to pack integer columns at the beginning of the tables. In many cases this can yield in some extra percentage points of speed - just because things are a bit smaller than if you don't make use of this little rule.
Get some great hints and tips about indexes in my post, "Find and Fix a Missing PostgreSQL Index".
UPDATE September 2023: This feature is now called pg_resetwal
, see the PostgreSQL documentation about it here. The blog below will be preserved for reference. See this blog by Laurenz Albe about corrupting databases to read more aboutpg_resetwal
.
PostgreSQL has proven to be one of the most reliable Open Source databases around. In the past we have dealt with customers who have achieved superior uptimes and who are really happy with PostgreSQL. But even if PostgreSQL does its job day after day after day - there are some components, which can just fail and cause trouble. In short: Hardware may crash, filesystem can fail, and so on. In 99% of all cases there is a way around total disaster. Either you have a replica, which protected you from downtime or you simply have a reasonably good backup to help out in case of disaster. But what if there is really no option left anymore? You got to make this database work again - no matter what (even if it is just a partial or an unreliable restore). Maybe pg_resetxlog is your last helping hand in this case.
Killing an instance
As you might know the purpose of the transaction log (= xlog or WAL) is to restore order and consistency in case of a crash. If the xlog is broken, you might face disaster. This is exactly what we want 😉
So, to cause some disaster I sent a “kill -9” to a database system under load to stop it. In case of a restart this would lead to a recovery process and PostgreSQL would fix things on startup by replaying the transaction log. Let us disturb the replay process a little ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
[hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007C bs=1024 count=16384 16384+0 records in 16384+0 records out 16777216 bytes (17 MB) copied, 0.0280045 s, 599 MB/s [hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007D bs=1024 count=16384 16384+0 records in 16384+0 records out 16777216 bytes (17 MB) copied, 0.0304789 s, 550 MB/s [hs@jacqueline pg_xlog]$ dd if=/dev/zero of=00000001000000050000007F bs=1024 count=16384 16384+0 records in 16384+0 records out 16777216 bytes (17 MB) copied, 0.028223 s, 594 MB/s |
What I have just done is really nasty. I have overwritten some transaction log files found in the pg_xlog directory with zeros. This should instantly make sure that the recovery process ends up in hell. Let us check if my attempts to really kill things has succeeded:
1 2 3 4 5 6 7 8 9 10 |
[hs@jacqueline test_db]$ pg_ctl -D /data start server starting LOG: database system was shut down at 2014-08-08 12:17:00 CEST LOG: invalid magic number 0000 in log segment 00000001000000050000007B, offset 0 LOG: invalid primary checkpoint record LOG: invalid magic number 0000 in log segment 00000001000000050000007B, offset 0 LOG: invalid secondary checkpoint record PANIC: could not locate a valid checkpoint record LOG: startup process (PID 18807) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure |
Wow, this is exactly what I have intended to do. Nothing works anymore so it is time to fix things and demonstrate the power of pg_resetxlog.
pg_resetxlog at work
pg_resetxlog is a C program shipped with PostgreSQL. So, there is no need to install it on top of PostgreSQL. It is simply there and ready for action in case it is needed. Actually it is very powerful as the following help page shows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
[hs@jacqueline ~]$ pg_resetxlog --help pg_resetxlog resets the PostgreSQL transaction log. Usage: pg_resetxlog [OPTION]... DATADIR Options: -e XIDEPOCH set next transaction ID epoch -f force update to be done -l XLOGFILE force minimum WAL starting location for new transaction log -m MXID,MXID set next and oldest multitransaction ID -n no update, just show what would be done (for testing) -o OID set next OID -O OFFSET set next multitransaction offset -V, --version output version information, then exit -x XID set next transaction ID -?, --help show this help, then exit |
It is possible to nicely configure pg_resetxlog to make it do precisely what is needed (maybe by just setting xlog back a little). However, in this example we don't care too much - we just want to reset the xlog completely:
1 2 3 |
[hs@jacqueline test_db]$ pg_resetxlog /data Transaction log reset |
One line is enough to reset the xlog of PostgreSQL.
Let us see if this has worked:
1 2 3 4 5 |
[hs@jacqueline test_db]$ pg_ctl -D . start server starting LOG: database system was shut down at 2014-08-08 12:18:51 CEST LOG: database system is ready to accept connections LOG: autovacuum launcher started |
Voila, the database was able to start up again. The autovacuum daemon has been launched nicely and the system is able to accept connection. A simple test reveals that all databases seem to be around:
1 2 3 4 5 6 7 8 9 10 11 |
[hs@jacqueline test_db]$ psql -l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+-------+----------+-------------+-------------+------------------- postgres | hs | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | hs | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/hs + | | | | | hs=CTc/hs template1 | hs | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/hs + | | | | | hs=CTc/hs test | hs | UTF8 | en_US.UTF-8 | en_US.UTF-8 | (4 rows) |
So, where is the problem?
Be cautious
pg_resetxlog is really the last line of defence. If you have to call it, you should expect some data to be lost. Maybe you will face slightly broken tables or corrupted indexes. You should not really trust this database anymore - however, you will have the chance to extract some data, restore at least something.
Under any circumstances: If you are forced to use pg_resetxlog I would really advise to not continue with the database instance in doubt, take a backup, PROPERLY check the data and start over with a new database instance.
With PostgreSQL 9.4 just around the corner and 8.4 just retired I thought it is time to review those years of development to see how the size of the PostgreSQL code base has evolved over time. In the past couple of years (I even already got to say “decades”) a lot of development has happened and all those changes had a significant impact on the sheer volume of the code.
Out of plain curiosity I decided to start a little analysis regarding the code size of the PostgreSQL project over time.
To start the analysis I decided to count the lines of plain C code, all header files as well as grammar files (.y). Here is the shell command used on all versions:
1 |
find . -type f -name "*.[chy]" | xargs -I {} cat {} | wc -l |
The results are pretty interesting:
9.3.5: 1190501
9.2.9: 1146762
9.1.14: 1091400
9.0.18: 1018774
8.4.22: 954521
The interesting thing here to see is that the size of the code has grown fairly consistently over time. From one major release to the next we see a constant increase in code source, which is more or less stable over the years.
Historically ...
Just out of curiosity: On 29th of January 1997 PostgreSQL 6.0 was released. A long long time ago. Guess what the size of the code was back then.
The answer is? 185437 lines
PostgreSQL has come a long way since then.
UPDATED JULY 19, 2023 - When I do training here at CYBERTEC, people often ask how to load data fast and efficiently. There are many ways to achieve this. One way not too many know about is COPY (FREEZE)
. It makes sure that PostgreSQL can organize data nicely straight away by instantly writing hint bits (as explained in my previous post).
To demonstrate COPY (FREEZE)
, first generate some data:
1 2 3 4 5 6 7 |
$ cat numbers.pl #!/usr/bin/perl for ($count = 1; $count <= 500000000; $count++) { print '$countt$countn'; } |
We will load the data into this table:
1 2 3 4 |
CREATE TABLE t_test ( a integer, b integer ); |
COPY
To import the data on my test system (single SATA disk in a Linux box) I use a normal COPY
:
1 2 3 |
test=# COPY t_test FROM '/data/numbers.txt'; COPY 500000000 Time: 627509.061 ms |
As you can see, it takes around 10 minutes to do so.
COPY (FREEZE)
To speed things up, let's try COPY (FREEZE)
:
1 2 |
test=# COPY t_test FROM '/data/numbers.txt' (FREEZE); ERROR: cannot perform COPY FREEZE because the table was not created or truncated in the current subtransaction |
The important thing here is that this command can only operate inside the transaction that created or truncated the table. Otherwise, it doesn't work (I'll explain the reason later). So, let's try the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
test=# BEGIN; BEGIN test=*# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 2/A7D6310 (1 row) test=*# TRUNCATE t_test; TRUNCATE test=*# timing Timing is on. test=*# COPY t_test FROM '/data/numbers.txt' FREEZE; COPY 500000000 Time: 304082.907 ms test=*# timing Timing is off. test=*# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 2/A7DC000 (1 row) test=*# COMMIT; COMMIT |
The execution time goes down dramatically!
COPY (FREEZE)
The main reason it's now more efficient is that PostgreSQL did not have to write transaction log data. That's because I configured wal_level = minimal
. With this setting, you can't use WAL archiving or replication, and PostgreSQL only logs enough data for crash recovery. Because of that, PostgreSQL does not need to write WAL for data modifications that take place in the transaction that created or truncated the table. Skipping WAL writes speeds up data loading considerably! Changing wal_level
requires a restart of the database, but it may be worth the pain if you need to load data quickly.
But there's also another gain: The second major benefit will be visible when the data is read for the first time after the import. As already shown in my previous posts, PostgreSQL has to set hint bits during the first reading operation. This translates to a large number of writes later on down the road.
COPY (FREEZE)
has been made to fix exactly this kind of problem - the data are already loaded in the frozen state, and there is no need to write hint bits any more. Frozen rows are unconditionally visible, and that is the reason why the table had to be created or truncated in the same transaction: otherwise, concurrent readers could see the rows before the transaction commits, which would violate transaction isolation.
Analytical applications can benefit greatly
from theCOPY (FREEZE)
feature!
I’ve recently seen some really broad tables (hundreds of columns) in a somewhat inefficiently structured database. Our PostgreSQL support customer complained about strange runtime behavior which could not be easily explained. To help other PostgreSQL users in this same situation, I decided to reveal the secrets of a fairly common performance problem many people don’t understand: Column order and column access.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
Post UPDATED June 2023: PostgreSQL is a highly sophisticated relational database system capable of performing wonderful operations. But, sophistication also means that there is a bit of complexity under the surface, which is not always well understood by users. One thing people usually don't know about are hint bits.
Actually, they are an internal optimization which makes sure that visibility checks are sped up considerably. As you might know, PostgreSQL traditionally had to decide for each row whether it was visible to a user or not. The best way to speed up the process is to avoid that - and that is why hint bits were invented.
When PostgreSQL finds a row that is visible to all transactions, it sets a bit in the row header indicating this fact. Next time someone finds this row, they can avoid comparing the row visibility information to their snapshot, because the hint bit already tells them that it must be visible.
To demonstrate how things work, let's create some demo data:
1 2 |
test=# CREATE TABLE t_test (a int, b int); CREATE TABLE |
A simple Perl script can create the data nicely. In this example, half a billion rows are used (big enough to demonstrate the point and yet small enough to do the test in reasonable time):
1 2 3 4 5 6 7 |
[hs@jacqueline blog]$ cat numbers.pl #!/usr/bin/perl for ($count = 1; $count <= 500000000; $count++) { print '$count $countn'; } |
1 2 |
test=# ALTER TABLE t_test SET (autovacuum_enabled = off); ALTER TABLE |
Then data is loaded into PostgreSQL. The easiest way to do that is to use “COPY … FROM PROGRAM” and to simply read things in through a pipe. It works like this:
1 2 3 4 5 6 |
test=# timing Timing is on. test=# COPY t_test FROM PROGRAM '/path/numbers.pl'; COPY 500000000 Time: 612716.438 ms |
(some Intel i5 and a 750 GB SATA disk) it takes around 10 minutes to load the data. Note that default PostgreSQL parameters were used (no checkpoint optimizations). “top” leaves us with output as shown in the next listing:
1 2 3 4 |
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27570 hs 20 0 257m 24m 23m R 48.6 0.2 0:10.33 postgres 27574 hs 20 0 119m 1584 1252 R 31.6 0.0 0:06.66 numbers.pl 27356 hs 20 0 256m 16m 16m S 1.3 0.1 0:00.18 postgres |
You can clearly see that generating numbers actually causes a decent share of the CPU load. Actually, this is good - because it means that PostgreSQL is cheap, relative to the data creation part.
Up until now, nothing special could be observed. Let's try to run a SELECT:
1 2 3 4 5 6 7 8 9 |
test=# SELECT count(*) FROM t_test; count ----------- 500000000 (1 row) Time: 359669.393 ms |
The interesting part here is not the result of the query. The really interesting part is hidden nicely below the surface.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
[hs@jacqueline blog]$ vmstat 2 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 164 4968092 46412 10534028 0 0 3486 1903 3 13 1 0 91 7 0 0 0 164 4967960 46412 10534144 0 0 0 0 78 225 0 0 100 0 0 0 0 164 4967960 46420 10534144 0 0 0 8 72 203 0 0 99 1 0 0 0 164 4967960 46420 10534144 0 0 0 0 103 301 0 0 100 0 0 0 1 164 4946260 46420 10555520 0 0 10586 20 246 437 1 0 97 2 0 1 1 164 4732608 46420 10763488 0 0 104000 0 1403 2046 10 2 76 12 0 0 1 164 4517592 46420 10972988 0 0 104704 0 1415 2042 10 3 76 11 0 1 0 164 4301832 46420 11183036 0 0 105024 0 1410 2038 10 3 75 12 0 1 0 164 4082228 46420 11396740 0 0 106880 4 1449 2048 11 3 75 11 0 0 1 164 3868576 46420 11604996 0 0 104064 0 1414 2050 10 3 75 12 0 1 0 164 3650460 46420 11817164 0 0 106176 0 1415 2058 11 2 75 12 0 0 1 164 3433460 46420 12028640 0 0 105664 0 1419 2078 10 2 75 12 0 1 1 164 3262588 46420 12194508 0 0 82944 45356 1199 2209 8 3 66 23 0 1 1 164 3143176 46420 12310644 0 0 58112 43818 949 1738 6 2 75 18 0 1 1 164 3018680 46420 12432100 0 0 60736 36900 940 1760 6 2 75 17 0 1 2 164 2897284 46420 12550404 0 0 59136 29726 917 1727 6 2 75 16 0 0 2 164 2775020 46420 12669348 0 0 59392 36642 912 1746 6 2 76 17 0 0 2 164 2656848 46420 12784480 0 0 57600 36388 883 1687 6 2 75 17 0 1 1 164 2537560 46420 12900636 0 0 58048 36644 937 1755 5 2 73 19 0 |
“bi” (= blocks in) and “bo” (= blocks out) are the two columns of interest here.
We start to read at roughly 100 MB / second, a reasonable value for a desktop PC featuring a single SATA drive. But then, things start to go south. At some point, PostgreSQL starts to write data out to disk. Why that? Well, when a row is loaded, PostgreSQL has no idea whether it is visible to others without checking visibility all over again. Therefore PostgreSQL will mark a row as “visible to everyone” when it's read for the first time. This is exactly where the I/O comes from - when rows are marked as “visible to everybody”. The advantage here is that visibility checks are sped up tremendously.
1 2 3 4 5 6 7 8 9 10 |
[hs@jacqueline blog]$ vmstat 2 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 164 191076 1688 15325392 0 0 104256 0 1401 2077 8 2 75 15 0 0 1 164 183140 1688 15334056 0 0 105792 0 1394 2113 8 2 75 15 0 0 1 164 187728 1688 15329596 0 0 106880 0 1429 2121 9 2 75 15 0 0 1 164 190828 1688 15327804 0 0 105472 0 1413 2124 8 2 75 15 0 2 0 164 184008 1688 15335792 0 0 104640 0 1389 2108 8 2 75 15 0 1 0 164 188348 1688 15336880 0 0 106240 0 1466 2107 9 2 75 14 0 1 0 164 191944 1692 15337824 0 0 100226 4 1372 2060 8 2 75 15 0 |
As you can see, there is no more writing going on - just a clean 100+ MB / second. Exactly what we want to see.
1 2 3 4 5 6 7 |
test=# SELECT count(*) FROM t_test; count ----------- 500000000 (1 row) Time: 110263.714 ms |
Wow, the time needed to perform this one has gone down dramatically! And more importantly, the execution time is pretty stable:
1 |
Time: 110382.362 ms |
The theory behind it is that reads happen more frequently than writes, and therefore investing in hint bits makes an awful lot of sense, because it seriously improves run-times.
But there is a bit more. What if we VACUUM the table?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 1 164 161936 616 15427528 0 0 50718 356 845 1262 4 2 79 15 0 0 1 164 168136 616 15421384 0 0 87680 636 1267 2517 6 4 75 15 0 0 1 164 169624 616 15419204 0 0 88704 624 1254 2529 6 3 75 15 0 0 1 164 179296 616 15409532 0 0 90304 560 1279 2435 6 4 75 15 0 0 1 164 185868 616 15402548 0 0 89216 628 1319 1898 7 4 74 15 0 1 1 164 178924 616 15409580 0 0 88832 628 1295 1960 7 3 75 15 0 0 1 164 183636 616 15404972 0 0 80166 624 1196 1802 6 3 75 16 0 0 1 164 186240 616 15402652 0 0 91648 568 1354 1951 7 4 75 15 0 0 2 164 180660 624 15408296 0 0 89472 670 1307 1982 6 3 75 16 0 1 1 164 170120 624 15418500 0 0 72260 49908 1080 2035 5 3 70 22 0 0 2 164 164788 624 15423656 0 0 67072 22782 995 1955 4 2 70 23 0 0 2 164 168012 624 15420556 0 0 61888 22492 984 1928 4 2 73 20 0 0 2 164 164044 624 15424496 0 0 60672 29582 959 1887 4 2 73 20 0 0 2 164 172352 624 15416452 0 0 67712 22524 1053 2097 5 3 72 20 0 |
Again, some writing is going on. We are seriously bound by the disk here, as the following couple of lines taken from “top” indicate:
1 2 3 4 5 |
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27570 hs 20 0 257m 30m 29m R 26.3 0.2 9:34.80 postgres 59 root 20 0 0 0 0 S 1.7 0.0 18:00.13 kswapd0 36 root 20 0 0 0 0 S 1.0 0.0 10:42.29 kblockd/2 27245 root 20 0 0 0 0 D 0.3 0.0 0:03.53 flush-8:16 30861 root 20 0 15036 1196 836 R 0.3 0.0 0:03.75 top |
1 2 3 |
test=# VACUUM t_test; VACUUM Time: 363773.532 ms |
The beauty now is that we not only know whether a single row is visible or not - VACUUM can also figure out whether an entire block is visible or not.
Again, we see an improvement in speed - but this time, it's not very big (just two seconds):
1 2 3 4 5 6 7 |
test=# SELECT count(*) FROM t_test; count ----------- 500000000 (1 row) Time: 108715.332 ms |
The main benefit really comes from those bits attached to a single line - and this benefit is massive.
If you liked this article, check out the next article in this series: Speed up PostgreSQL data loading with COPY (FREEZE)
For further information on how to speed up performance in PostgreSQL, see this post about CLUSTER and HOT updates, or see Laurenz Albe's first article on HOT updates.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information