ERROR: invalid byte sequence - Fix bad encoding in PostgreSQL

05.2023 | Category: How To | Tags: administration, sql help

05.2023

Category: How To

(Inscription seen on a T-shirt)

Table of Contents

It's annoying to get error messages caused by encoding problems. But it is more annoying not to get error messages and end up with bad data in the database. I'll show you how to fix bad encoding.

Some basic facts about encoding in PostgreSQL

Each PostgreSQL database has a server encoding. You can inspect (but not change) the PostgreSQL parameter server_encoding to see the encoding of the database you are connected to. You can choose the database encoding when you create a database, but you cannot change the server encoding after creating the database. This restriction is necessary because changing the server encoding will render string data corrupt.

Each database session has a client encoding, which you can view and change using the PostgreSQL parameter client_encoding. If your client uses the C client library libpq, you can determine the client encoding with the environment variable PGCLIENTENCODING. In addition, psql has the command encoding to view and change the client encoding. psql tries to guess a good client encoding from the current locale. If the client encoding isn't specified anywhere, PostgreSQL will set it to the server encoding.

PostgreSQL expects that character data sent by a client are encoded in the client encoding and converts them to the server encoding. Conversely, PostgreSQL converts query results to the client encoding.

PostgreSQL supports many encodings. All supported server encodings must be a superset of ASCII.

Preparing test data

The server encoding of my database is UTF8. We'll use this simple table:

CREATE TABLE texts (
   id bigint PRIMARY KEY,
   t  text
);

CREATE TABLE texts (

id bigint PRIMARY KEY,

t text

);

Using Linux' echo command, I create an SQL script in UTF-8 encoding:

echo -e 'INSERT INTO texts VALUES (1, 'schxc3xb6n');' 
    > insert_utf8.sql

1 2	echo -e 'INSERT INTO texts VALUES (1, 'schxc3xb6n');' > insert_utf8.sql

Then I create a second script, with the same string in LATIN-1 (= ISO 8859-1) encoding:

echo -e 'INSERT INTO texts VALUES (2, 'schxf6n');' 
    > insert_latin1.sql

1 2	echo -e 'INSERT INTO texts VALUES (2, 'schxf6n');' > insert_latin1.sql

Both strings contain the German letter “ö”. LATIN-1 encodes it as a single byte, while it is two bytes in UTF-8.

Encoding errors in PostgreSQL

Many people have seen error messages like this one:

psql -c 'encoding UTF8' -f insert_latin1.sql 
psql:insert_latin1.sql:1: ERROR:  invalid byte sequence for encoding 'UTF8': 0xf6 0x6e 0x27 0x29

1 2	psql -c 'encoding UTF8' -f insert_latin1.sql psql:insert_latin1.sql:1: ERROR: invalid byte sequence for encoding 'UTF8': 0xf6 0x6e 0x27 0x29

PostgreSQL is very strict about data integrity and won't let you store strings that are not properly encoded. This may be annoying, but it saves you from worse trouble. Other database systems are more relaxed about data integrity (Oracle lets you store anything as long as client and server encoding are identical).

How corrupt strings can occur in PostgreSQL

Given the above, you'd think that it is impossible to get badly encoded data in PostgreSQL. But that is not the case:

psql -c 'encoding LATIN1' -f insert_utf8.sql 
INSERT 0 1

1 2	psql -c 'encoding LATIN1' -f insert_utf8.sql INSERT 0 1

There was no error, because both bytes that make up “ö” in UTF-8 also encode valid characters in LATIN-1. This becomes obvious if we query the table with a correctly set client encoding:

psql -c 'encoding UTF8' -c 'TABLE texts'
 id │   t    
════╪════════
  1 │ schÃ¶n
(1 row)

psql -c 'encoding UTF8' -c 'TABLE texts'

id │ t

════╪════════

1 │ schÃ¶n

(1 row)

However, such data curruption can remain undetected for a long while. As long as you consistently set the client encoding to the wrong value, the data will appear to be correct:

psql -c 'encoding LATIN1' -c 'TABLE texts'
 id │   t    
════╪════════
  1 │ schön
(1 row)

psql -c 'encoding LATIN1' -c 'TABLE texts'

id │ t

════╪════════

1 │ schön

(1 row)

How to see what is really stored in the database

When diagnosing encoding problems, it is very helpful to know what data the database actually stores. After all, there are other reasons for improperly displayed characters:

your client encoding can be wrong
your terminal can be mis-configured

To see the actual bytes that the database stores, make sure that you set the parameter bytea_output to its default value hex, then convert the offending string to the data type bytea:

SELECT convert_to(t, 'SQL_ASCII') FROM texts WHERE id = 1;

         t          
════════════════════
 x736368c383c2b66e
(1 row)

SELECT convert_to(t, 'SQL_ASCII') FROM texts WHERE id = 1;

════════════════════

x736368c383c2b66e

(1 row)

The first three characters are ASCII characters and are encoded as single bytes. The “ö” now has become four bytes (c3 83 c2 b6), where the first two bytes encode “Ã” and the last two bytes encode “¶”.

Fixing bad encoding that is consistent

Since you cannot change the server encoding, the only way to do that is to export and import the database. How exactly you do that depends on how bad the mess is.

If all your non-ASCII characters have the same problem (the client encoding was always set to the same wrong value), the case is comparatively simple. You use plain format pg_dump to export the database while explicitly setting the bad client encoding. In our case, that would look like

pg_dump -E LATIN1 -t texts -f dump.sql dbname

1	pg_dump -E LATIN1 -t texts -f dump.sql dbname

Then you edit dump.sql with a text editor and change the line that sets the client encoding to the value that actually matches the data:

SET client_encoding = 'UTF8';

1	SET client_encoding = 'UTF8';

Now import the modified dump.sql using psql, and you are done.

Fixing inconsistently encoded data

Adding a row with a different encoding to the test table

We import the second SQL script to get a second, correctly encoded row in our test table:

psql -c 'encoding LATIN1' -f insert_latin1.sql 
INSERT 0 1

1 2	psql -c 'encoding LATIN1' -f insert_latin1.sql INSERT 0 1

With the client encoding that is correct for my terminal, the data now look like this:

psql -c 'encoding UTF8' -c 'TABLE texts'
 id │   t    
════╪════════
  1 │ schÃ¶n
  2 │ schön
(2 rows)

psql -c 'encoding UTF8' -c 'TABLE texts'

id │ t

════╪════════

1 │ schÃ¶n

2 │ schön

(2 rows)

Clearly, this case is more difficult than the previous one. There is no absolutely safe way to determine which data are correctly encoded and which ones are not. However, we can repair most of the damage as follows.

Exporting the data in parts

First, we create four dump files:

pg_dump --section=pre-data -t texts -f predata.sql dbname

pg_dump --section=data -E LATIN1 --inserts 
    -t texts -f data_latin1.sql dbname

pg_dump --section=data -E UTF8 --inserts --on-conflict-do-nothing 
    -t texts -f data_utf8.sql dbname

pg_dump --section=post-data -t texts -f postdata.sql dbname

pg_dump --section=pre-data -t texts -f predata.sql dbname

pg_dump --section=data -E LATIN1 --inserts

-t texts -f data_latin1.sql dbname

pg_dump --section=data -E UTF8 --inserts --on-conflict-do-nothing

-t texts -f data_utf8.sql dbname

pg_dump --section=post-data -t texts -f postdata.sql dbname

The option --on-conflict-do-nothing is new in PostgreSQL v12. You can omit that option if you have to work with an older version of pg_dump, but in that case you will have to ignore errors during the import of data_utf8.sql.

The next step is to edit data_latin1.sql and change the line that sets client_encoding to the value UTF8.

Importing the data into a new database

Then simply import the four files in a slightly different order into a new database using psql:

psql -d newdb -f predata.sql

psql -d newdb -f data_latin1.sql 2>/dev/null

psql -d newdb -f postdata.sql

psql -d newdb -f data_utf8.sql

psql -d newdb -f predata.sql

psql -d newdb -f data_latin1.sql 2>/dev/null

psql -d newdb -f postdata.sql

psql -d newdb -f data_utf8.sql

During the import of data_latin1.sql, we get an error for the rows with correctly encoded data, but psql imports the rows with incorrectly encoded data correctly. Importing postdata.sql creates the primary key constraint for the table. After that, importing data_utf8.sql will skip the rows that we already managed to import, but insert the rows that are missing after the import of data_latin1.sql.

The result is convincing:

psql -d newdb -c 'TABLE texts'
 id │   t   
════╪═══════
  1 │ schön
  2 │ schön
(2 rows)

psql -d newdb -c 'TABLE texts'

id │ t

════╪═══════

1 │ schön

2 │ schön

(2 rows)

Conclusion

Incorrectly encoded strings are an annoying problem in a database. If we know what encodings are present in the database, we can fix most of the corruption with a carefully designed series of exports and imports. The downtime incurred by such data repair is something we cannot avoid. It is possible that some incorrectly encoded strings remain; you will have to find and fix those manually.

If you are interested in data corruption, you may want to read my articles about collations and data corruption or what to watch out for if you want to avoid data corruption.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

2 responses to “ERROR: invalid byte sequence - Fix bad encoding in PostgreSQL”

Andrew Gierth says:

May 30, 2023 at 4:11 pm

Casting text to bytea isn't a good way to see how values are encoded; many text values will give errors on the cast. (It doesn't do what you apparently think: it expects the text value to be the input representation of a bytea value.)

Instead use convert_to(col,'SQL_ASCII') to see the bytes of the text value; SQL_ASCII in this context means "no conversion", so you get the stored bytes regardless of what the server encoding is.

There are a bunch of ways to fix certain kinds of systematic errors without needing to do wholesale dump/restore. The most common errors I've seen from people on IRC are: double-encoded UTF8, and mixed UTF8 and Latin1 or Win1252 in an SQL_ASCII or LATIN1 db. These are both fixable in an automated way with very high reliability, thanks to UTF8's properties.

Reply
- laurenz says:
  
  May 31, 2023 at 10:18 am
  
  Thanks for the tip with convert_to(), that is indeed better.
  
  It is certainly also possible to run UPDATEs with a WHERE condition that finds the problematic strings. That is likely preferable if you know which columns are affected. My method of dump and restore is a simple and dumb approach to the double encoding problem; it might be a good idea if much of the data are affected or if you don't know which columns are affected.
  
  Reply

ERROR: invalid byte sequence - Fix bad encoding in PostgreSQL

Some basic facts about encoding in PostgreSQL

Preparing test data

Encoding errors in PostgreSQL

How corrupt strings can occur in PostgreSQL

How to see what is really stored in the database

Fixing bad encoding that is consistent

Fixing inconsistently encoded data

Adding a row with a different encoding to the test table

Exporting the data in parts

Importing the data into a new database

Conclusion

2 responses to “ERROR: invalid byte sequence - Fix bad encoding in PostgreSQL”

Leave a Reply Cancel reply

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts