Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

invalid byte sequence for encoding "UTF8"

People also ask

What is invalid byte sequence?

Explanation: This error occurs when you send text data, but either the source encoding doesn't match that currently set on the database, or the text stream contains binary data like NUL bytes that are not allowed within a string.

What is utf8?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

Can UTF-8 contain null?

NULL is a valid UTF-8 character. If specific languages and their standard libraries choose to treat it as a string terminator (C, I'm looking at you), well, then fine. But it's still valid Unicode.


If you need to store UTF8 data in your database, you need a database that accepts UTF8. You can check the encoding of your database in pgAdmin. Just right-click the database, and select "Properties".

But that error seems to be telling you there's some invalid UTF8 data in your source file. That means that the copy utility has detected or guessed that you're feeding it a UTF8 file.

If you're running under some variant of Unix, you can check the encoding (more or less) with the file utility.

$ file yourfilename
yourfilename: UTF-8 Unicode English text

(I think that will work on Macs in the terminal, too.) Not sure how to do that under Windows.

If you use that same utility on a file that came from Windows systems (that is, a file that's not encoded in UTF8), it will probably show something like this:

$ file yourfilename
yourfilename: ASCII text, with CRLF line terminators

If things stay weird, you might try to convert your input data to a known encoding, to change your client's encoding, or both. (We're really stretching the limits of my knowledge about encodings.)

You can use the iconv utility to change encoding of the input data.

iconv -f original_charset -t utf-8 originalfile > newfile

You can change psql (the client) encoding following the instructions on Character Set Support. On that page, search for the phrase "To enable automatic character set conversion".


psql=# copy tmp from '/path/to/file.csv' with delimiter ',' csv header encoding 'windows-1251';

Adding encoding option worked in my case.


If you are ok with discarding nonconvertible characters, you can use -c flag

iconv -c -t utf8 filename.csv > filename.utf8.csv

and then copy them to your table


Apparently I can just set the encoding on the fly,

 set client_encoding to 'latin1'

And then re-run the query. Not sure what encoding I should be using though.


latin1 made the characters legible, but most of the accented characters were in upper-case where they shouldn't have been. I assumed this was due to a bad encoding, but I think its actually the data that was just bad. I ended up keeping the latin1 encoding, but pre-processing the data and fixed the casing issues.


This error means that records encoding in the file is different with respect to the connection. In this case iconv may return the error, sometimes even despite //IGNORE flag:

iconv -f ASCII -t utf-8//IGNORE < b.txt > /a.txt

iconv: illegal input sequence at position (some number)

The trick is to find incorrect characters and replace it. To do it on Linux use "vim" editor:

vim (your text file), press "ESC": button and type ":goto (number returned by iconv)"

To find non ASCII characters you may use the following command:

grep --color='auto' -P "[\x80-\xFF]"

If you remove incorrect characters please check if you really need to convert your file: probably the problem is already solved.