Bulk loading is the quickest way to import large amounts of data into a PostgreSQL database. There are various ways to facilitate large-scale imports, and many different ways to scale are also available. This post will show you how to use some of these tricks, and explain how fast importing works. You can use this knowledge to optimize data warehousing or any other data-intensive workload.
There are several things to take into consideration in order to speed up bulk loading of massive amounts of data using PostgreSQL:
- INSERT vs. COPY
- Optimizing checkpoints
- Logged vs. unlogged tables
- Recreating indexes
- Enabled and disabled triggers
- Improving column order and space consumption
Let us take a look at these things in greater detail.
INSERT vs. COPY
The first thing to consider is that COPY is usually a LOT better than plain inserts. The reason is that INSERT has a lot of overhead. People often ask: What kind of overhead is there? What makes COPY so much faster than INSERT? There are a variety of reasons: In the case of INSERT, every statement has to check for locks, check for the existence of the table and the columns in the table, check for permissions, look up data types and so on. In the case of COPY, this is only done once, which is a lot faster. Whenever you want to write large amounts of data, data COPY is usually the way to go.
To show what kind of impact this change has in terms of performance, I have compiled a short example. Let’s create a table as well as some sample data:
test=# CREATE TABLE t_sample ( a varchar(50), b int, c varchar(50), d int ); CREATE TABLE
The sample table consists of 4 columns which is pretty simple. In the next step, we will compile a script containing 1 million INSERT statements in a single transaction:
BEGIN; INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1); INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1); INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1); … COMMIT;
Running the script can be done using psql:
iMac:~ hs$ time psql test < /tmp/sample.sql > /dev/null real 1m20.883s user 0m11.515s sys 0m10.070s
We need around 81 seconds to run this simple test, which is A LOT of time. Single INSERT statements are therefore obviously not the solution to perform quick imports and efficient bulk loading.
As I have already mentioned, COPY is a lot more efficient than INSERT, so let’s use the same data, but feed it to COPY instead;
COPY t_sample FROM stdin; abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 … \.
Running the script is again an easy thing to do:
iMac:~ hs$ time psql test < /tmp/bulk.sql > /dev/null real 0m2.646s user 0m0.110s sys 0m0.043s
Wow, the speed has improved from 81 to just 2.6 seconds:
Keep in mind that I have executed this test on a fairly old machine, on a totally untuned database. On modern hardware and on a more modern operating system, a lot more can be achieved than on my local iMac desktop machine. Loading 1 million lines or more is not uncommon in the real world. Of course, this data depends on the length of a “record” and so on. However, it is important to get a feeling for what is possible and what is not.
NOTE: Runtimes might vary. This has many reasons. One of them is certainly related to the hardware in use here. We have seen that many SSDs provide us with quite unstable response times.
Adjusting checkpoints for faster bulk loading
The PostgreSQL configuration does have an impact on bulk loading performance. There are many configuration parameters which are vital to database performance, and loading in particular. However, I explicitly want to focus your attention on checkpoint and I/O performance. If you want to load billions of rows, I/O is king. There are various angles to approach the topic:
- Reduce the amount of data written
- Make tables smaller if possible (column order)
- Reduce the amount of WAL written
- Write data more efficiently
- Longer checkpoint distances
- Better I/O scheduling
The following settings are important:
- max_wal_size: Maximum amount of WAL to be created (soft limit)
- checkpoint_completion_target: Control checkpoint behavior
In general it is a REALLY good idea to stretch checkpoints by a significant amount. Setting this value to 100 or 200 GB in case of bulk-load intense workloads is definitely not out of scope.
Keep in mind that increased checkpoint distances DO NOT put your server at risk. It merely affects the way PostgreSQL writes data. Also keep in mind that more disk space will be consumed and recovery might take longer, in case of a crash.
If you want to learn more about checkpointing, check out this article about reducing the amount of WAL written.
CREATE TABLE vs. CREATE UNLOGGED TABLE
However, what if there were a way to get rid of WAL altogether? Well, there is one. It is called an “unlogged table”. What is the general idea? Often we got the following sequence of events:
- Load large amount of data into a PostgreSQL (“staging area”)
- Perform some aggregations
- Drop the initial import
This is the ideal scenario to use the WAL bypass provided by unlogged tables:
test=# DROP TABLE t_sample ; DROP TABLE test=# CREATE UNLOGGED TABLE t_sample ( a varchar(50), b int, c varchar(50), d int ); CREATE TABLE test=# \d t_sample Unlogged table "public.t_sample" Column | Type | Collation | Nullable | Default --------+-----------------------+-----------+----------+--------- a | character varying(50) | | | b | integer | | | c | character varying(50) | | | d | integer | | |
Let’s load the same data again:
iMac:~ hs$ time psql test < /tmp/sample.sql > /dev/null real 0m59.296s user 0m10.597s sys 0m9.417s iMac:~ hs$ time psql test < /tmp/bulk.sql > /dev/null real 0m0.618s user 0m0.107s sys 0m0.038s
As you can see, the entire thing is a lot faster. 81 seconds vs. 59 seconds and 2.6 vs. 0.6 seconds. The difference is massive.
The reason is that an unlogged table does not have to write the data twice (no WAL needed). However, this comes with a price tag attached to it:
- In case of a normal shutdown, an unlogged table is just like a normal table
- In case of a crash, an unlogged table is guaranteed to be empty
- The content of an unlogged table is not replicated
These restrictions imply that an unlogged table is not suitable for storing “normal” data. However, it is ideal for staging areas and bulk loading.
Tables can be made logged and unlogged. Many people expect these to be cheap operations but this is not true. Let’s take a look and see what happens:
test=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 5/3AC7CCD0 (1 row) test=# ALTER TABLE t_sample SET LOGGED; ALTER TABLE test=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 5/3F9048A8 (1 row)
In addition to setting the table from UNLOGGED to LOGGED, I have measured the current WAL position. What we can see is that a lot of data has been written:
test=# SELECT '5/3F9048A8'::pg_lsn - '5/3AC7CCD0'::pg_lsn; ?column? ----------- 80247768 (1 row) Time: 11.298 ms
Wow, we have produced 80 MB of WAL (if you do exactly one COPY on an empty table – the amount will grow if you run more imports). In case of COPY + INSERT, the volume will be a lot higher.
From this, we draw the conclusion that if we want to do efficient bulk loading setting a table from LOGGED to UNLOGGED, importing the data and setting it back to LOGGED might not be the best of all ideas – because as soon as a table is set back to LOGGED, the entire content of the table has to be sent to the WAL, to make sure that the replicas can receive the content of the table.
Direct import vs. recreating indexes
test=# SELECT count(*) FROM t_sample; count --------- 1000001 (1 row) test=# CREATE TABLE t_index (LIKE t_sample); CREATE TABLE test=# CREATE INDEX idx_a ON t_index (a); CREATE INDEX test=# CREATE INDEX idx_b ON t_index (b); CREATE INDEX test=# \timing Timing is on. test=# INSERT INTO t_index SELECT * FROM t_sample; INSERT 0 1000001 Time: 8396.210 ms (00:08.396)
It takes around 8 seconds to copy the data over. Let’s try the same thing by creating the indexes later:
test=# CREATE TABLE t_noindex (LIKE t_sample); CREATE TABLE test=# INSERT INTO t_noindex SELECT * FROM t_sample; INSERT 0 1000001 Time: 4789.017 ms (00:04.789) test=# SET maintenance_work_mem TO '1 GB'; SET Time: 13.059 ms test=# CREATE INDEX idx_aa ON t_noindex (a); CREATE INDEX Time: 1151.521 ms (00:01.152) test=# CREATE INDEX idx_bb ON t_noindex (b); CREATE INDEX Time: 1086.972 ms (00:01.087)
We can see that the copy process (= INSERT) is a lot faster than before. In total, it is quicker to produce the index later. Also keep in mind that I am using synthetic data on Mac OSX (not too efficient) here. If you repeat the test with a lot more real data, the difference is a lot higher.
The bottom line is:
Create indexes after importing data if possible.
Enabled triggers vs. disabled triggers
Triggers are also an important factor. One could say that triggers are “the natural enemy” of bulk loading performance. Let’s take a look at the following example:
iMac:~ hs$ head -n 20 /tmp/bulk.sql BEGIN; CREATE FUNCTION dummy() RETURNS trigger AS $$ BEGIN NEW.b := NEW.b + 1; NEW.d := NEW.d + 1; RETURN NEW; END; $$ LANGUAGE 'plpgsql'; CREATE TRIGGER mytrig BEFORE INSERT ON t_sample FOR EACH ROW EXECUTE PROCEDURE dummy(); COPY t_sample FROM stdin; abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1
Our trigger is really simple. All it does is to modify two entries in our data. However, the trigger will add an extra function call to every row, which really adds up.
In our case, we have got the following data: The variation with trigger is around 3 times slower. However, the real difference highly depends on the complexity of the trigger, the size of a row and a lot more. There is no way to state that “a trigger slows things down by a factor of X”. One has to see, case-by-case, what happens.
Optimizing column order for bulk loading
There is more to importing large amounts of data into PostgreSQL than meets the eye. So far, we have optimized checkpoints, touched indexes, triggers and so on. But what about the column order? Let’s try to find out.
In PostgreSQL, column order does make a real difference. It is generally a good idea to put “fixed length” columns in front. In other words: int8, int4, timestamptz and so on should be at the beginning of the table. Variable length data types such as varchar, text and so on should be at the end of the table. The reason for this is that CPU alignment is an issue on disk. This is true for normal heap tables (not for zheap).
Shrinking the size of a table without changing the content can speed things up, because it helps to avoid or reduce one of the key bottlenecks when bulk loading data: I/O. Check out this article to find out more.
Tooling for bulk loading
If what you have seen so far is still not enough, we can recommend some tools to improve bulk loading even more. The following tools can be recommended:
Both tools are very well known and widely used. You can use them safely.
If you have further questions regarding these tools, please don’t hesitate to ask in the comment section, or send us an email.
If you want to know more about PostgreSQL performance, we also recommend checking out our consulting services. We help you to tune your database and make sure that your servers are operating perfectly.