CYBERTEC PostgreSQL Logo

ERROR: invalid byte sequence - Fix bad encoding in PostgreSQL

05.2023 / Category: / Tags: |

bad encoding in PostgreSQL is often annoying to fix -- cover image for blog post by Laurenz Albe about how to fix bad encoding
(Inscription seen on a T-shirt)

It's annoying to get error messages caused by encoding problems. But it is more annoying not to get error messages and end up with bad data in the database. I'll show you how to fix bad encoding.

Some basic facts about encoding in PostgreSQL

Each PostgreSQL database has a server encoding. You can inspect (but not change) the PostgreSQL parameter server_encoding to see the encoding of the database you are connected to. You can choose the database encoding when you create a database, but you cannot change the server encoding after creating the database. This restriction is necessary because changing the server encoding will render string data corrupt.

Each database session has a client encoding, which you can view and change using the PostgreSQL parameter client_encoding. If your client uses the C client library libpq, you can determine the client encoding with the environment variable PGCLIENTENCODING. In addition, psql has the command encoding to view and change the client encoding. psql tries to guess a good client encoding from the current locale. If the client encoding isn't specified anywhere, PostgreSQL will set it to the server encoding.

PostgreSQL expects that character data sent by a client are encoded in the client encoding and converts them to the server encoding. Conversely, PostgreSQL converts query results to the client encoding.

PostgreSQL supports many encodings. All supported server encodings must be a superset of ASCII.

Preparing test data

The server encoding of my database is UTF8. We'll use this simple table:

Using Linux' echo command, I create an SQL script in UTF-8 encoding:

Then I create a second script, with the same string in LATIN-1 (= ISO 8859-1) encoding:

Both strings contain the German letter “ö”. LATIN-1 encodes it as a single byte, while it is two bytes in UTF-8.

Encoding errors in PostgreSQL

Many people have seen error messages like this one:

PostgreSQL is very strict about data integrity and won't let you store strings that are not properly encoded. This may be annoying, but it saves you from worse trouble. Other database systems are more relaxed about data integrity (Oracle lets you store anything as long as client and server encoding are identical).

How corrupt strings can occur in PostgreSQL

Given the above, you'd think that it is impossible to get badly encoded data in PostgreSQL. But that is not the case:

There was no error, because both bytes that make up “ö” in UTF-8 also encode valid characters in LATIN-1. This becomes obvious if we query the table with a correctly set client encoding:

However, such data curruption can remain undetected for a long while. As long as you consistently set the client encoding to the wrong value, the data will appear to be correct:

How to see what is really stored in the database

When diagnosing encoding problems, it is very helpful to know what data the database actually stores. After all, there are other reasons for improperly displayed characters:

  • your client encoding can be wrong
  • your terminal can be mis-configured

To see the actual bytes that the database stores, make sure that you set the parameter bytea_output to its default value hex, then convert the offending string to the data type bytea:

The first three characters are ASCII characters and are encoded as single bytes. The “ö” now has become four bytes (c3 83 c2 b6), where the first two bytes encode “Ô and the last two bytes encode “¶”.

Fixing bad encoding that is consistent

Since you cannot change the server encoding, the only way to do that is to export and import the database. How exactly you do that depends on how bad the mess is.

If all your non-ASCII characters have the same problem (the client encoding was always set to the same wrong value), the case is comparatively simple. You use plain format pg_dump to export the database while explicitly setting the bad client encoding. In our case, that would look like

Then you edit dump.sql with a text editor and change the line that sets the client encoding to the value that actually matches the data:

Now import the modified dump.sql using psql, and you are done.

Fixing inconsistently encoded data

Adding a row with a different encoding to the test table

We import the second SQL script to get a second, correctly encoded row in our test table:

With the client encoding that is correct for my terminal, the data now look like this:

Clearly, this case is more difficult than the previous one. There is no absolutely safe way to determine which data are correctly encoded and which ones are not. However, we can repair most of the damage as follows.

Exporting the data in parts

First, we create four dump files:

The option --on-conflict-do-nothing is new in PostgreSQL v12. You can omit that option if you have to work with an older version of pg_dump, but in that case you will have to ignore errors during the import of data_utf8.sql.

The next step is to edit data_latin1.sql and change the line that sets client_encoding to the value UTF8.

Importing the data into a new database

Then simply import the four files in a slightly different order into a new database using psql:

During the import of data_latin1.sql, we get an error for the rows with correctly encoded data, but psql imports the rows with incorrectly encoded data correctly. Importing postdata.sql creates the primary key constraint for the table. After that, importing data_utf8.sql will skip the rows that we already managed to import, but insert the rows that are missing after the import of data_latin1.sql.

The result is convincing:

Conclusion

Incorrectly encoded strings are an annoying problem in a database. If we know what encodings are present in the database, we can fix most of the corruption with a carefully designed series of exports and imports. The downtime incurred by such data repair is something we cannot avoid. It is possible that some incorrectly encoded strings remain; you will have to find and fix those manually.

If you are interested in data corruption, you may want to read my articles about collations and data corruption or what to watch out for if you want to avoid data corruption.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Andrew Gierth
Andrew Gierth
1 year ago

Casting text to bytea isn't a good way to see how values are encoded; many text values will give errors on the cast. (It doesn't do what you apparently think: it expects the text value to be the input representation of a bytea value.)

Instead use convert_to(col,'SQL_ASCII') to see the bytes of the text value; SQL_ASCII in this context means "no conversion", so you get the stored bytes regardless of what the server encoding is.

There are a bunch of ways to fix certain kinds of systematic errors without needing to do wholesale dump/restore. The most common errors I've seen from people on IRC are: double-encoded UTF8, and mixed UTF8 and Latin1 or Win1252 in an SQL_ASCII or LATIN1 db. These are both fixable in an automated way with very high reliability, thanks to UTF8's properties.

laurenz
laurenz
1 year ago
Reply to  Andrew Gierth

Thanks for the tip with convert_to(), that is indeed better.

It is certainly also possible to run UPDATEs with a WHERE condition that finds the problematic strings. That is likely preferable if you know which columns are affected. My method of dump and restore is a simple and dumb approach to the double encoding problem; it might be a good idea if much of the data are affected or if you don't know which columns are affected.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram