ALTER TABLE: High-Availability taken care of

01.2018 | Category: Performance | Tags: high availability, table

01.2018

High availability in PostgreSQL: When running an application in production, it might happen that the data structure has to be changed once in a while. Adding columns, dropping columns, etc. might simply be necessary once in a while. However, changing data structures should not be done mindlessly – there are some things you have to take care of. The main issue with DDLs is that in some cases locks are held for quite a long time, which can have serious consequences if you are running PostgreSQL on critical production systems. I hope that this blog can help people to run DDLs more wisely with ALTER TABLE.

Table of Contents

Creating some demo data

As usual some demo data is needed to show, how things work. In case the following script simply creates 10 million simple integer values:

test=# CREATE TABLE data (id int);
CREATE TABLE

test=# INSERT INTO data
       SELECT * FROM generate_series(1, 10000000);
INSERT 0 10000000

1

2

3

4

5

6

test=# CREATE TABLE data (id int);

CREATE TABLE

test=# INSERT INTO data

SELECT * FROM generate_series(1, 10000000);

INSERT 0 10000000

Be aware of locking: Indexing

One of the most common issues related to DDLs is the creation of indexes. If you create an index in PostgreSQL, “normally” writes will be blocked while the index is in creation. In case of very large tables this can be quite counterproductive. A large index simply takes time to build and concurrent writes might suffer as shown in the next example:

Connection 1	Connection 2
BEGIN;
CREATE UNIQUE INDEX idx_id ON data (id);	BEGIN;
-- running	INSERT INTO data VALUES (0);
-- running	-- waiting
COMMIT;	-- will proceed
	COMMIT;

If data is really large, the INSERT might have to wait for an unacceptable amount of time. This is especially critical, if the number of concurrent requests is high. Connection pools might get into trouble or your webserver might run out of connections. In short: A solution is needed.

Note that reading is still possible – PostgreSQL will only block writes while the index is built. This is extremely important to point out.

If you cannot afford to lock a table while building an index, CREATE INDEX CONCURRENTLY is the tool of choice. CREATE INDEX CONCURRENTLY takes longer than a “normal” CREATE INDEX, but it will allow for concurrent writes and it helps to avoid excessive table locking. Note that CREATE INDEX CONCURRENTLY is not guaranteed to succeed and might leave you with an invalid index, which has to be dropped manually in certain cases. However, if you are indexing a 1+TB table or so there is no way around concurrent indexing if you are under constant load, which is often the case.

Here is an example: As you can see no conflicts happen and nobody has to wait for an unacceptable amount of time.

Connection 1	Connection 2
CREATE UNIQUE INDEX CONCURRENTLY idx_id2 ON data (id);
-- running	INSERT INTO data VALUES (-1);
-- running	INSERT INTO data VALUES (-2);
-- done	INSERT INTO data VALUES (-3);

Adding columns to tables in PostgreSQL

Adding a new column to a table in PostgreSQL is most likely the most common problem reported by people around the globe. The important thing here is that there are actually two cases:

Adding a column without a default value
Adding a column with a default value

Let us see, what those two cases have in stock for us:

Connection 1	Connection 2
BEGIN;
ALTER TABLE data ADD COLUMN x int;
Time: 1.241 ms	INSERT INTO data VALUES (-4);
COMMIT;	-- waits a VERY short time

BEGIN;
ALTER TABLE data ADD COLUMN y int DEFAULT 0;
-- we need time	INSERT INTO data VALUES (-5);
Time: 11888.547 ms	-- waiting for a long time
COMMIT;	-- finally proceeds

As you can see the ALTER TABLE … ADD COLUMN without the default value is basically done in around 1 millisecond, which is totally fine. The concurrent write does not have to wait for long. As long as the DDL can get the table lock it needs for a short moment, everything will be fine. In PostgreSQL adding a column with no default value is merely a metadata change in the catalog – there is no need to rewrite the data files. Keep in mind that this is not true for all other database engines – especially not for some expensive commercial ones.

The story is a bit different if you need a default value. In this case it has to be written to disk. Our general recommendation is to avoid this kind of operation if possible. If your tables is large, rewriting things is not too attractive.

Making use of tablespaces

Another important thing to consider is the use of tablespaces. While tablespaces are a nice way to scale I/O in general, moving tables around does not come for free. Keep in mind that ALTER TABLE … SET TABLESPACE locks the table and blocks writing operation.

Here is an example:

Connection 1	Connection 2
BEGIN;
ALTER TABLE data SET TABLESPACE some_storage;
-- usually time consuming	INSERT INTO data VALUES (-6);
-- usually time consuming	-- waiting for a long time
COMMIT;	-- finally done

The good thing here is that there is actually a way to get around the table. pg_squeeze (https://github.com/cybertec-postgresql/pg_squeeze) is a tool designed to shrink a bloated table. However, it can do a lot more: You can “abuse” it to move a table from one tablespace to some other storage location. Using Cybertec pg_squeeze you can not only shrink your storage footprint – you can also elegantly avoid heavy locking. Go and check it out.

Adding foreign keys

Foreign keys and constraints are also an important factor. If you want to add foreign keys or CHECK constraints PostgreSQL will verify the content of the table to ensure the correctness of the key. Those checks don’t come for free – PostgreSQL will again lock the table:

test=# ALTER TABLE data
          ADD CONSTRAINT constname
          FOREIGN KEY (id)
          REFERENCES data(id);

1

2

3

4

test=# ALTER TABLE data

ADD CONSTRAINT constname

FOREIGN KEY (id)

REFERENCES data(id);

Adding keys ex-post without locking the table is impossible. Therefore, it is important to think ahead and plan your moves.

Dropping columns in PostgreSQL

Let us come to a more relaxing thing: DROP COLUMN. In PostgreSQL DROP TABLE does not rewrite the table. It simply makes it invisible. VACUUM will take care of cleanup asynchronously. DROP COLUMN is therefore really fast. However, keep in mind that a DROP COLUMN can only start if there are no concurrent reads:

Connection 1	Connection 2
	BEGIN;
	SELECT count(*) FROM data
ALTER TABLE data DROP COLUMN y;	-- takes a while
-- has to wait	-- takes a while
-- has to wait	-- takes a while
-- finally proceeds	COMMIT;

Otherwise DROP COLUMN will simply return in no time.

There is always more …

There is always more to say. Still, the examples outlined in this blog cover some of the most common issues people are facing around the world. Feel free to add your comments and send some feedback.

Read our latest blogs about high availability to find out more.

ALTER TABLE: High-Availability taken care of

Creating some demo data

Be aware of locking: Indexing

Adding columns to tables in PostgreSQL

Making use of tablespaces

Adding foreign keys

Dropping columns in PostgreSQL

There is always more …

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts