PostgreSQL: Bulk loading huge amounts of data

05.2021 | Category: Performance | Tags: bulk loading, zheap

05.2021

Bulk loading is the quickest way to import large amounts of data into a PostgreSQL database. There are various ways to facilitate large-scale imports, and many different ways to scale are also available. This post will show you how to use some of these tricks, and explain how fast importing works. You can use this knowledge to optimize data warehousing or any other data-intensive workload.

Table of Contents

There are several things to take into consideration in order to speed up bulk loading of massive amounts of data using PostgreSQL:

INSERT vs. COPY
Optimizing checkpoints
Logged vs. unlogged tables
Recreating indexes
Enabled and disabled triggers
Improving column order and space consumption

Let us take a look at these things in greater detail.

INSERT vs. COPY

The first thing to consider is that COPY is usually a LOT better than plain inserts. The reason is that INSERT has a lot of overhead. People often ask: What kind of overhead is there? What makes COPY so much faster than INSERT? There are a variety of reasons: In the case of INSERT, every statement has to check for locks, check for the existence of the table and the columns in the table, check for permissions, look up data types and so on. In the case of COPY, this is only done once, which is a lot faster. Whenever you want to write large amounts of data, data COPY is usually the way to go.

To show what kind of impact this change has in terms of performance, I have compiled a short example.

Let’s create a table as well as some sample data:

test=# CREATE TABLE t_sample
(
     a              varchar(50),
     b              int,
     c              varchar(50),
     d              int
);
CREATE TABLE

test=# CREATE TABLE t_sample

(

a varchar(50),

b int,

c varchar(50),

d int

);

CREATE TABLE

The sample table consists of 4 columns which is pretty simple.

In the next step, we will compile a script containing 1 million INSERT statements in a single transaction:

BEGIN;
INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1);
INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1);
INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1);
…
COMMIT;

BEGIN;

INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1);

…

COMMIT;

Running the script can be done using psql:

iMac:~ hs$ time psql test < /tmp/sample.sql > /dev/null

real 1m20.883s
user 0m11.515s
sys 0m10.070s

iMac:~ hs$ time psql test < /tmp/sample.sql > /dev/null

real 1m20.883s

user 0m11.515s

sys 0m10.070s

We need around 81 seconds to run this simple test, which is A LOT of time. Single INSERT statements are therefore obviously not the solution to perform quick imports and efficient bulk loading.

As I have already mentioned, COPY is a lot more efficient than INSERT, so let’s use the same data, but feed it to COPY instead;

COPY t_sample FROM stdin;
abcd 1 abcd 1
abcd 1 abcd 1
abcd 1 abcd 1
abcd 1 abcd 1
…
.

COPY t_sample FROM stdin;

abcd 1 abcd 1

…

Running the script is again an easy thing to do:

iMac:~ hs$ time psql test < /tmp/bulk.sql > /dev/null

real 0m2.646s
user 0m0.110s
sys 0m0.043s

iMac:~ hs$ time psql test < /tmp/bulk.sql > /dev/null

real 0m2.646s

user 0m0.110s

sys 0m0.043s

The speed has improved from 81 to just 2.6 seconds:

Keep in mind that I have executed this test on a fairly old machine, on a totally untuned database. On modern hardware and on a more modern operating system, a lot more can be achieved than on my local iMac desktop machine. Loading 1 million lines or more is not uncommon in the real world. Of course, this data depends on the length of a “record” and so on. However, it is important to get a feeling for what is possible and what is not.

NOTE: Runtimes might vary. This has many reasons. One of them is certainly related to the hardware in use here. We have seen that many SSDs provide us with quite unstable response times.

Adjusting checkpoints for faster bulk loading

The PostgreSQL configuration does have an impact on bulk loading performance. There are many configuration parameters which are vital to database performance, and loading in particular. However, I explicitly want to focus your attention on checkpoint and I/O performance. If you want to load billions of rows, I/O is king. There are various angles to approach the topic:

Reduce the amount of data written
- Make tables smaller if possible (column order)
- Reduce the amount of WAL written
Write data more efficiently
- Longer checkpoint distances
- Better I/O scheduling

The following settings are important:

max_wal_size: Maximum amount of WAL to be created (soft limit)
checkpoint_completion_target: Control checkpoint behavior

In general it is a REALLY good idea to stretch checkpoints by a significant amount.

Setting this value to 100 or 200 GB in case of bulk-load intense workloads is definitely not out of scope.

Keep in mind that increased checkpoint distances DO NOT put your server at risk. It merely affects the way PostgreSQL writes data. Also keep in mind that more disk space will be consumed and recovery might take longer, in case of a crash.

If you want to learn more about checkpointing, check out this article about reducing the amount of WAL written.

CREATE TABLE vs. CREATE UNLOGGED TABLE

However, what if there were a way to get rid of WAL altogether? Well, there is one. It is called an “unlogged table”. What is the general idea? Often we got the following sequence of events:

Load large amount of data into a PostgreSQL (“staging area”)
Perform some aggregations
Drop the initial import

This is the ideal scenario to use the WAL bypass provided by unlogged tables:

test=# DROP TABLE t_sample ;
DROP TABLE
test=# CREATE UNLOGGED TABLE t_sample
(
     a            varchar(50),
     b            int,
     c            varchar(50),
     d            int
);
CREATE TABLE
test=# d t_sample
Unlogged table 'public.t_sample'
Column  | Type                  | Collation | Nullable | Default
--------+-----------------------+-----------+----------+---------
a       | character varying(50) |           |          |
b       | integer               |           |          |
c       | character varying(50) |           |          |
d       | integer               |           |          |

test=# DROP TABLE t_sample ;

DROP TABLE

test=# CREATE UNLOGGED TABLE t_sample

(

a varchar(50),

b int,

c varchar(50),

d int

);

CREATE TABLE

test=# d t_sample

Unlogged table 'public.t_sample'

Column | Type | Collation | Nullable | Default

--------+-----------------------+-----------+----------+---------

a | character varying(50) | | |

b | integer | | |

c | character varying(50) | | |

d | integer | | |

Let’s load the same data again:

iMac:~ hs$ time psql test < /tmp/sample.sql > /dev/null

real 0m59.296s
user 0m10.597s
sys 0m9.417s

iMac:~ hs$ time psql test < /tmp/bulk.sql > /dev/null

real 0m0.618s
user 0m0.107s
sys 0m0.038s

iMac:~ hs$ time psql test < /tmp/sample.sql > /dev/null

real 0m59.296s

user 0m10.597s

sys 0m9.417s

iMac:~ hs$ time psql test < /tmp/bulk.sql > /dev/null

real 0m0.618s

user 0m0.107s

sys 0m0.038s

As you can see, the entire thing is a lot faster. 81 seconds vs. 59 seconds and 2.6 vs. 0.6 seconds. The difference is massive.

The reason is that an unlogged table does not have to write the data twice (no WAL needed). However, this comes with a price tag attached to it:

In case of a normal shutdown, an unlogged table is just like a normal table
In case of a crash, an unlogged table is guaranteed to be empty
The content of an unlogged table is not replicated

These restrictions imply that an unlogged table is not suitable for storing “normal” data. However, it is ideal for staging areas and bulk loading.

Tables can be made logged and unlogged. Many people expect these to be cheap operations but this is not true.

Let’s take a look and see what happens:

test=# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
        5/3AC7CCD0
(1 row)

test=# ALTER TABLE t_sample SET LOGGED;
ALTER TABLE

test=# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
         5/3F9048A8
(1 row)

test=# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

5/3AC7CCD0

(1 row)

test=# ALTER TABLE t_sample SET LOGGED;

ALTER TABLE

test=# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

5/3F9048A8

(1 row)

In addition to setting the table from UNLOGGED to LOGGED, I have measured the current WAL position.

What we can see is that a lot of data has been written:

test=# SELECT '5/3F9048A8'::pg_lsn - '5/3AC7CCD0'::pg_lsn;
  ?column?
-----------
  80247768
(1 row)

Time: 11.298 ms

test=# SELECT '5/3F9048A8'::pg_lsn - '5/3AC7CCD0'::pg_lsn;

?column?

-----------

80247768

(1 row)

Time: 11.298 ms

Wow, we have produced 80 MB of WAL (if you do exactly one COPY on an empty table - the amount will grow if you run more imports). In case of COPY + INSERT, the volume will be a lot higher.

From this, we draw the conclusion that if we want to do efficient bulk loading setting a table from LOGGED to UNLOGGED, importing the data and setting it back to LOGGED might not be the best of all ideas - because as soon as a table is set back to LOGGED, the entire content of the table has to be sent to the WAL, to make sure that the replicas can receive the content of the table.

Direct import vs. recreating indexes

test=# SELECT count(*) FROM t_sample;
  count
---------
1000001
(1 row)

test=# CREATE TABLE t_index (LIKE t_sample);
CREATE TABLE

test=# CREATE INDEX idx_a ON t_index (a);
CREATE INDEX
test=# CREATE INDEX idx_b ON t_index (b);
CREATE INDEX

test=# timing
Timing is on.
test=# INSERT INTO t_index SELECT * FROM t_sample;
INSERT 0 1000001
Time: 8396.210 ms (00:08.396)

test=# SELECT count(*) FROM t_sample;

count

---------

1000001

(1 row)

test=# CREATE TABLE t_index (LIKE t_sample);

CREATE TABLE

test=# CREATE INDEX idx_a ON t_index (a);

CREATE INDEX

test=# CREATE INDEX idx_b ON t_index (b);

CREATE INDEX

test=# timing

Timing is on.

test=# INSERT INTO t_index SELECT * FROM t_sample;

INSERT 0 1000001

Time: 8396.210 ms (00:08.396)

It takes around 8 seconds to copy the data over. Let’s try the same thing by creating the indexes later:

test=# CREATE TABLE t_noindex (LIKE t_sample);
CREATE TABLE
test=# INSERT INTO t_noindex SELECT * FROM t_sample;
INSERT 0 1000001
Time: 4789.017 ms (00:04.789)
test=# SET maintenance_work_mem TO '1 GB';
SET
Time: 13.059 ms
test=# CREATE INDEX idx_aa ON t_noindex (a);
CREATE INDEX
Time: 1151.521 ms (00:01.152)
test=# CREATE INDEX idx_bb ON t_noindex (b);
CREATE INDEX
Time: 1086.972 ms (00:01.087)

test=# CREATE TABLE t_noindex (LIKE t_sample);

CREATE TABLE

test=# INSERT INTO t_noindex SELECT * FROM t_sample;

INSERT 0 1000001

Time: 4789.017 ms (00:04.789)

test=# SET maintenance_work_mem TO '1 GB';

SET

Time: 13.059 ms

test=# CREATE INDEX idx_aa ON t_noindex (a);

CREATE INDEX

Time: 1151.521 ms (00:01.152)

test=# CREATE INDEX idx_bb ON t_noindex (b);

CREATE INDEX

Time: 1086.972 ms (00:01.087)

We can see that the copy process (= INSERT) is a lot faster than before. In total, it is quicker to produce the index later. Also keep in mind that I am using synthetic data on Mac OSX (not too efficient) here. If you repeat the test with a lot more real data, the difference is a lot higher.

Create indexes after importing data if possible.

Enabled triggers vs. disabled triggers

Triggers are also an important factor. One could say that triggers are “the natural enemy” of bulk loading performance. Let’s take a look at the following example:

iMac:~ hs$ head -n 20 /tmp/bulk.sql
BEGIN;

CREATE FUNCTION dummy()
RETURNS trigger AS
$
    BEGIN
       NEW.b := NEW.b + 1;
       NEW.d := NEW.d + 1;
       RETURN NEW;
    END;
$ LANGUAGE 'plpgsql';

CREATE TRIGGER mytrig BEFORE INSERT ON t_sample
FOR EACH ROW EXECUTE PROCEDURE dummy();

COPY t_sample FROM stdin;
abcd 1 abcd 1
abcd 1 abcd 1
abcd 1 abcd 1
abcd 1 abcd 1

iMac:~ hs$ head -n 20 /tmp/bulk.sql

BEGIN;

CREATE FUNCTION dummy()

RETURNS trigger AS

BEGIN

NEW.b := NEW.b + 1;

NEW.d := NEW.d + 1;

RETURN NEW;

END;

$ LANGUAGE 'plpgsql';

CREATE TRIGGER mytrig BEFORE INSERT ON t_sample

FOR EACH ROW EXECUTE PROCEDURE dummy();

COPY t_sample FROM stdin;

abcd 1 abcd 1

Our trigger is really simple. All it does is to modify two entries in our data. However, the trigger will add an extra function call to every row, which really adds up.
In our case, we have got the following data: The variation with trigger is around 3 times slower. However, the real difference highly depends on the complexity of the trigger, the size of a row and a lot more. There is no way to state that “a trigger slows things down by a factor of X”. One has to see, case-by-case, what happens.

Optimizing column order for bulk loading

There is more to importing large amounts of data into PostgreSQL than meets the eye. So far, we have optimized checkpoints, touched indexes, triggers and so on. But what about the column order? Let’s try to find out.

In PostgreSQL, column order does make a real difference. It is generally a good idea to put “fixed length” columns in front. In other words: int8, int4, timestamptz and so on should be at the beginning of the table. Variable length data types such as varchar, text and so on should be at the end of the table. The reason for this is that CPU alignment is an issue on disk. This is true for normal heap tables (not for zheap).

Shrinking the size of a table without changing the content can speed things up, because it helps to avoid or reduce one of the key bottlenecks when bulk loading data: I/O. Check out this article to find out more.

Tooling for bulk loading

If what you have seen so far is still not enough, we can recommend some tools to improve bulk loading even more. The following tools can be recommended:

Both tools are very well known and widely used. You can use them safely.

If you have further questions regarding these tools, please don’t hesitate to ask in the comment section, or send us an email.

Finally …

If you want to know more about PostgreSQL performance, we also recommend checking out our consulting services. We help you to tune your database and make sure that your servers are operating perfectly.

4 responses to “PostgreSQL: Bulk loading huge amounts of data”

peroje says:

July 2, 2021 at 2:35 pm

wow, thank you so much for amazing! guide, learned so much from it!

Reply
marcus says:

July 8, 2021 at 1:16 pm

Amazing article! Congrats!!

Reply
HeatZync says:

August 23, 2021 at 7:32 am

Thanks for a good article Hans-Jürgen. Say, do you know whether PostgreSQL does some optimization when inserting multiple records in "batch", e.g. using multiple threads when processing multiple INSERT statements in a single transaction?

Reply
Touring Tim says:

February 22, 2023 at 2:55 am

The first test copy vs insert isn't a like for like. Either have 3 copy statements or 1 insert statement with multiple values

Reply

PostgreSQL: Bulk loading huge amounts of data

INSERT vs. COPY

The speed has improved from 81 to just 2.6 seconds:

Adjusting checkpoints for faster bulk loading

In general it is a REALLY good idea to stretch checkpoints by a significant amount.

CREATE TABLE vs. CREATE UNLOGGED TABLE

Let’s take a look and see what happens:

Direct import vs. recreating indexes

Enabled triggers vs. disabled triggers

Optimizing column order for bulk loading

Tooling for bulk loading

Finally …

4 responses to “PostgreSQL: Bulk loading huge amounts of data”

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts