bad encoding in PostgreSQL is often annoying to fix -- cover image for blog post by Laurenz Albe about how to fix bad encoding
(Inscription seen on a T-shirt)

It’s annoying to get error messages caused by encoding problems. But it is more annoying not to get error messages and end up with bad data in the database. I’ll show you how to fix bad encoding.

Some basic facts about encoding in PostgreSQL

Each PostgreSQL database has a server encoding. You can inspect (but not change) the PostgreSQL parameter server_encoding to see the encoding of the database you are connected to. You can choose the database encoding when you create a database, but you cannot change the server encoding after creating the database. This restriction is necessary because changing the server encoding will render string data corrupt.

Each database session has a client encoding, which you can view and change using the PostgreSQL parameter client_encoding. If your client uses the C client library libpq, you can determine the client encoding with the environment variable PGCLIENTENCODING. In addition, psql has the command \encoding to view and change the client encoding. psql tries to guess a good client encoding from the current locale. If the client encoding isn’t specified anywhere, PostgreSQL will set it to the server encoding.

PostgreSQL expects that character data sent by a client are encoded in the client encoding and converts them to the server encoding. Conversely, PostgreSQL converts query results to the client encoding.

PostgreSQL supports many encodings. All supported server encodings must be a superset of ASCII.

Preparing test data

The server encoding of my database is UTF8. We’ll use this simple table:

   id bigint PRIMARY KEY,
   t  text

Using Linux’ echo command, I create an SQL script in UTF-8 encoding:

echo -e "INSERT INTO texts VALUES (1, 'sch\xc3\xb6n');" \
    > insert_utf8.sql

Then I create a second script, with the same string in LATIN-1 (= ISO 8859-1) encoding:

echo -e "INSERT INTO texts VALUES (2, 'sch\xf6n');" \
    > insert_latin1.sql

Both strings contain the German letter “ö”. LATIN-1 encodes it as a single byte, while it is two bytes in UTF-8.

Encoding errors in PostgreSQL

Many people have seen error messages like this one:

psql -c '\encoding UTF8' -f insert_latin1.sql 
psql:insert_latin1.sql:1: ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x6e 0x27 0x29

PostgreSQL is very strict about data integrity and won’t let you store strings that are not properly encoded. This may be annoying, but it saves you from worse trouble. Other database systems are more relaxed about data integrity (Oracle lets you store anything as long as client and server encoding are identical).

How corrupt strings can occur in PostgreSQL

Given the above, you’d think that it is impossible to get badly encoded data in PostgreSQL. But that is not the case:

psql -c '\encoding LATIN1' -f insert_utf8.sql 

There was no error, because both bytes that make up “ö” in UTF-8 also encode valid characters in LATIN-1. This becomes obvious if we query the table with a correctly set client encoding:

psql -c '\encoding UTF8' -c 'TABLE texts'
 id │   t    
  1 │ schön
(1 row)

However, such data curruption can remain undetected for a long while. As long as you consistently set the client encoding to the wrong value, the data will appear to be correct:

psql -c '\encoding LATIN1' -c 'TABLE texts'
 id │   t    
  1 │ schön
(1 row)

How to see what is really stored in the database

When diagnosing encoding problems, it is very helpful to know what data the database actually stores. After all, there are other reasons for improperly displayed characters:

  • your client encoding can be wrong
  • your terminal can be mis-configured

To see the actual bytes that the database stores, make sure that you set the parameter bytea_output to its default value hex, then convert the offending string to the data type bytea:

SELECT convert_to(t, 'SQL_ASCII') FROM texts WHERE id = 1;

(1 row)

The first three characters are ASCII characters and are encoded as single bytes. The “ö” now has become four bytes (c3 83 c2 b6), where the first two bytes encode “Ô and the last two bytes encode “¶”.

Fixing bad encoding that is consistent

Since you cannot change the server encoding, the only way to do that is to export and import the database. How exactly you do that depends on how bad the mess is.

If all your non-ASCII characters have the same problem (the client encoding was always set to the same wrong value), the case is comparatively simple. You use plain format pg_dump to export the database while explicitly setting the bad client encoding. In our case, that would look like

pg_dump -E LATIN1 -t texts -f dump.sql dbname

Then you edit dump.sql with a text editor and change the line that sets the client encoding to the value that actually matches the data:

SET client_encoding = 'UTF8';

Now import the modified dump.sql using psql, and you are done.

Fixing inconsistently encoded data

Adding a row with a different encoding to the test table

We import the second SQL script to get a second, correctly encoded row in our test table:

psql -c '\encoding LATIN1' -f insert_latin1.sql 

With the client encoding that is correct for my terminal, the data now look like this:

psql -c '\encoding UTF8' -c 'TABLE texts'
 id │   t    
  1 │ schön
  2 │ schön
(2 rows)

Clearly, this case is more difficult than the previous one. There is no absolutely safe way to determine which data are correctly encoded and which ones are not. However, we can repair most of the damage as follows.

Exporting the data in parts

First, we create four dump files:

pg_dump --section=pre-data -t texts -f predata.sql dbname

pg_dump --section=data -E LATIN1 --inserts \
    -t texts -f data_latin1.sql dbname

pg_dump --section=data -E UTF8 --inserts --on-conflict-do-nothing \
    -t texts -f data_utf8.sql dbname

pg_dump --section=post-data -t texts -f postdata.sql dbname

The option --on-conflict-do-nothing is new in PostgreSQL v12. You can omit that option if you have to work with an older version of pg_dump, but in that case you will have to ignore errors during the import of data_utf8.sql.

The next step is to edit data_latin1.sql and change the line that sets client_encoding to the value UTF8.

Importing the data into a new database

Then simply import the four files in a slightly different order into a new database using psql:

psql -d newdb -f predata.sql

psql -d newdb -f data_latin1.sql 2>/dev/null

psql -d newdb -f postdata.sql

psql -d newdb -f data_utf8.sql

During the import of data_latin1.sql, we get an error for the rows with correctly encoded data, but psql imports the rows with incorrectly encoded data correctly. Importing postdata.sql creates the primary key constraint for the table. After that, importing data_utf8.sql will skip the rows that we already managed to import, but insert the rows that are missing after the import of data_latin1.sql.

The result is convincing:

psql -d newdb -c 'TABLE texts'
 id │   t   
  1 │ schön
  2 │ schön
(2 rows)


Incorrectly encoded strings are an annoying problem in a database. If we know what encodings are present in the database, we can fix most of the corruption with a carefully designed series of exports and imports. The downtime incurred by such data repair is something we cannot avoid. It is possible that some incorrectly encoded strings remain; you will have to find and fix those manually.

If you are interested in data corruption, you may want to read my articles about collations and data corruption or what to watch out for if you want to avoid data corruption.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.