Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode error handling with Python 3's readlines()

I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed?

UnicodeEncodeError: ‘charmap’ codec can’t decode byte 0x81 in position 7827: character maps to undefined.

like image 678
Bob Avatar asked May 07 '12 19:05

Bob


People also ask

How do I fix Unicode errors in Python 3?

Try to run your code normally. If you get any error reading a Unicode text file you need to use the encoding='utf-8' parameter while reading the file.

What is Unicode error in Python?

When we use such a string as a parameter to any function, there is a possibility of the occurrence of an error. Such error is known as Unicode error in Python. We get such an error because any character after the Unicode escape sequence (“ \u ”) produces an error which is a typical error on windows.

How do I decode a UTF 8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.


2 Answers

In Python 3, pass an appropriate errors= value (such as errors=ignore or errors=replace) on creating your file object (presuming it to be a subclass of io.TextIOWrapper -- and if it isn't, consider wrapping it in one!); also, consider passing a more likely encoding than charmap (when you aren't sure, utf-8 is always a good place to start).

For instance:

f = open('misc-notes.txt', encoding='utf-8', errors='ignore')

In Python 2, the read() operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don't have a better guess for their real encoding:

your_string.decode('utf-8', 'replace')

...to replace unhandled characters, or

your_string.decode('utf-8', 'ignore')

to simply ignore them.

That said, finding and using their real encoding (rather than guessing utf-8) would be preferred.

like image 127
Charles Duffy Avatar answered Oct 01 '22 05:10

Charles Duffy


You should open the file with a codecs to make sure that the file gets interpreted as UTF8.

import codecs
fd = codecs.open(filename,'r',encoding='utf-8')
data = fd.read()
like image 1
optixx Avatar answered Oct 01 '22 05:10

optixx