I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed? <blockquote> UnicodeEncodeError: ‘charmap’ codec can’t decode byte 0x81 in position 7827: character maps to undefined. </blockquote>

In Python 3, pass an appropriate <code>errors=</code> value (such as <code>errors=ignore</code> or <code>errors=replace</code>) on creating your file object (presuming it to be a subclass of <code>io.TextIOWrapper</code> -- and if it isn't, consider wrapping it in one!); also, consider passing a more likely encoding than <code>charmap</code> (when you aren't sure, <code>utf-8</code> is always a good place to start). For instance: <pre class="prettyprint"><code>f = open('misc-notes.txt', encoding='utf-8', errors='ignore') </code></pre> <hr> In Python 2, the <code>read()</code> operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don't have a better guess for their real encoding: <pre class="prettyprint"><code>your_string.decode('utf-8', 'replace') </code></pre> ...to replace unhandled characters, or <pre class="prettyprint"><code>your_string.decode('utf-8', 'ignore') </code></pre> to simply ignore them. That said, finding and using their real encoding (rather than guessing <code>utf-8</code>) would be preferred.

You should open the file with a codecs to make sure that the file gets interpreted as UTF8. <pre class="prettyprint"> import codecs fd = codecs.open(filename,'r',encoding='utf-8') data = fd.read() </pre>

Unicode error handling with Python 3's readlines()

2 Answers

In Python 3, pass an appropriate errors= value (such as errors=ignore or errors=replace) on creating your file object (presuming it to be a subclass of io.TextIOWrapper -- and if it isn't, consider wrapping it in one!); also, consider passing a more likely encoding than charmap (when you aren't sure, utf-8 is always a good place to start).

For instance:

f = open('misc-notes.txt', encoding='utf-8', errors='ignore')

In Python 2, the read() operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don't have a better guess for their real encoding:

your_string.decode('utf-8', 'replace')

...to replace unhandled characters, or

your_string.decode('utf-8', 'ignore')

to simply ignore them.

That said, finding and using their real encoding (rather than guessing utf-8) would be preferred.

127

answered Oct 01 '22 05:10

Charles Duffy

You should open the file with a codecs to make sure that the file gets interpreted as UTF8.

import codecs
fd = codecs.open(filename,'r',encoding='utf-8')
data = fd.read()

answered Oct 01 '22 05:10

optixx

Related questions
                            
                                Converting byte string in unicode string
                            
                                Wrapping long y labels in matplotlib tight layout using setp
                            
                                django template if or statement
                            
                                Python Multiprocessing Lib Error (AttributeError: __exit__)
                            
                                Converting string 'yyyy-mm-dd' into datetime [duplicate]
                            
                                How to access pandas DataFrame datetime index using strings
                            
                                How to pick one key from a dictionary randomly
                            
                                Numpy, multiply array with scalar [duplicate]
                            
                                Custom authentication in Google App Engine
                            
                                Appending to the same list from different processes using multiprocessing
                            
                                Detect what a python string begins with [duplicate]
                            
                                Use IPython REPL in VS Code
                            
                                Convert byte array back to numpy array
                            
                                Celery: is there a way to write custom JSON Encoder/Decoder?
                            
                                What does clf mean in machine learning?
                            
                                How to exit Python script in Command Prompt?
                            
                                Seaborn - change bar color according to hue name
                            
                                Collatz Conjecture Python - Incorrect Output Above 2 Trillion (Only!)
                            
                                Incredibly basic lxml questions: getting HTML/string content of lxml.etree._Element?
                            
                                What is the difference between installing a package using pip vs. apt-get?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode error handling with Python 3's readlines()

Tags:

python

text

python-3.x

encoding

Bob

People also ask

2 Answers

Charles Duffy

optixx

Recent Activity

Donate For Us