(Inscription seen on a T-shirt)
It’s annoying to get error messages caused by encoding problems. But it is more annoying not to get error messages and end up with bad data in the database. I’ll show you how to fix bad encoding.
Some basic facts about encoding in PostgreSQL
Each PostgreSQL database has a server encoding. You can inspect (but not change) the PostgreSQL parameter server_encoding
to see the encoding of the database you are connected to. You can choose the database encoding when you create a database, but you cannot change the server encoding after creating the database. This restriction is necessary because changing the server encoding will render string data corrupt.
Each database session has a client encoding, which you can view and change using the PostgreSQL parameter client_encoding
. If your client uses the C client library libpq, you can determine the client encoding with the environment variable PGCLIENTENCODING
. In addition, psql
has the command \encoding
to view and change the client encoding. psql
tries to guess a good client encoding from the current locale. If the client encoding isn’t specified anywhere, PostgreSQL will set it to the server encoding.
PostgreSQL expects that character data sent by a client are encoded in the client encoding and converts them to the server encoding. Conversely, PostgreSQL converts query results to the client encoding.
PostgreSQL supports many encodings. All supported server encodings must be a superset of ASCII.
Preparing test data
The server encoding of my database is UTF8
. We’ll use this simple table:
CREATE TABLE texts ( id bigint PRIMARY KEY, t text );
Using Linux’ echo
command, I create an SQL script in UTF-8 encoding:
echo -e "INSERT INTO texts VALUES (1, 'sch\xc3\xb6n');" \ > insert_utf8.sql
Then I create a second script, with the same string in LATIN-1 (= ISO 8859-1) encoding:
echo -e "INSERT INTO texts VALUES (2, 'sch\xf6n');" \ > insert_latin1.sql
Both strings contain the German letter “ö”. LATIN-1 encodes it as a single byte, while it is two bytes in UTF-8.
Encoding errors in PostgreSQL
Many people have seen error messages like this one:
psql -c '\encoding UTF8' -f insert_latin1.sql psql:insert_latin1.sql:1: ERROR: invalid byte sequence for encoding "UTF8": 0xf6 0x6e 0x27 0x29
PostgreSQL is very strict about data integrity and won’t let you store strings that are not properly encoded. This may be annoying, but it saves you from worse trouble. Other database systems are more relaxed about data integrity (Oracle lets you store anything as long as client and server encoding are identical).
How corrupt strings can occur in PostgreSQL
Given the above, you’d think that it is impossible to get badly encoded data in PostgreSQL. But that is not the case:
psql -c '\encoding LATIN1' -f insert_utf8.sql INSERT 0 1
There was no error, because both bytes that make up “ö” in UTF-8 also encode valid characters in LATIN-1. This becomes obvious if we query the table with a correctly set client encoding:
psql -c '\encoding UTF8' -c 'TABLE texts' id │ t ════╪════════ 1 │ schön (1 row)
However, such data curruption can remain undetected for a long while. As long as you consistently set the client encoding to the wrong value, the data will appear to be correct:
psql -c '\encoding LATIN1' -c 'TABLE texts' id │ t ════╪════════ 1 │ schön (1 row)
How to see what is really stored in the database
When diagnosing encoding problems, it is very helpful to know what data the database actually stores. After all, there are other reasons for improperly displayed characters:
- your client encoding can be wrong
- your terminal can be mis-configured
To see the actual bytes that the database stores, make sure that you set the parameter bytea_output
to its default value hex
, then convert the offending string to the data type bytea
:
SELECT convert_to(t, 'SQL_ASCII') FROM texts WHERE id = 1; t ════════════════════ \x736368c383c2b66e (1 row)
The first three characters are ASCII characters and are encoded as single bytes. The “ö” now has become four bytes (c3 83 c2 b6), where the first two bytes encode “Ô and the last two bytes encode “¶”.
Fixing bad encoding that is consistent
Since you cannot change the server encoding, the only way to do that is to export and import the database. How exactly you do that depends on how bad the mess is.
If all your non-ASCII characters have the same problem (the client encoding was always set to the same wrong value), the case is comparatively simple. You use plain format pg_dump
to export the database while explicitly setting the bad client encoding. In our case, that would look like
pg_dump -E LATIN1 -t texts -f dump.sql dbname
Then you edit dump.sql
with a text editor and change the line that sets the client encoding to the value that actually matches the data:
SET client_encoding = 'UTF8';
Now import the modified dump.sql
using psql
, and you are done.
Fixing inconsistently encoded data
Adding a row with a different encoding to the test table
We import the second SQL script to get a second, correctly encoded row in our test table:
psql -c '\encoding LATIN1' -f insert_latin1.sql INSERT 0 1
With the client encoding that is correct for my terminal, the data now look like this:
psql -c '\encoding UTF8' -c 'TABLE texts' id │ t ════╪════════ 1 │ schön 2 │ schön (2 rows)
Clearly, this case is more difficult than the previous one. There is no absolutely safe way to determine which data are correctly encoded and which ones are not. However, we can repair most of the damage as follows.
Exporting the data in parts
First, we create four dump files:
pg_dump --section=pre-data -t texts -f predata.sql dbname pg_dump --section=data -E LATIN1 --inserts \ -t texts -f data_latin1.sql dbname pg_dump --section=data -E UTF8 --inserts --on-conflict-do-nothing \ -t texts -f data_utf8.sql dbname pg_dump --section=post-data -t texts -f postdata.sql dbname
The option --on-conflict-do-nothing
is new in PostgreSQL v12. You can omit that option if you have to work with an older version of pg_dump
, but in that case you will have to ignore errors during the import of data_utf8.sql
.
The next step is to edit data_latin1.sql
and change the line that sets client_encoding
to the value UTF8
.
Importing the data into a new database
Then simply import the four files in a slightly different order into a new database using psql
:
psql -d newdb -f predata.sql psql -d newdb -f data_latin1.sql 2>/dev/null psql -d newdb -f postdata.sql psql -d newdb -f data_utf8.sql
During the import of data_latin1.sql
, we get an error for the rows with correctly encoded data, but psql
imports the rows with incorrectly encoded data correctly. Importing postdata.sql
creates the primary key constraint for the table. After that, importing data_utf8.sql
will skip the rows that we already managed to import, but insert the rows that are missing after the import of data_latin1.sql
.
The result is convincing:
psql -d newdb -c 'TABLE texts' id │ t ════╪═══════ 1 │ schön 2 │ schön (2 rows)
Conclusion
Incorrectly encoded strings are an annoying problem in a database. If we know what encodings are present in the database, we can fix most of the corruption with a carefully designed series of exports and imports. The downtime incurred by such data repair is something we cannot avoid. It is possible that some incorrectly encoded strings remain; you will have to find and fix those manually.
If you are interested in data corruption, you may want to read my articles about collations and data corruption or what to watch out for if you want to avoid data corruption.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.