Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python3 UnicodeDecodeError with readlines() method

Trying to create a twitter bot that reads lines and posts them. Using Python3 and tweepy, via a virtualenv on my shared server space. This is the part of the code that seems to have trouble:

#!/foo/env/bin/python3  import re import tweepy, time, sys  argfile = str(sys.argv[1])  filename=open(argfile, 'r') f=filename.readlines() filename.close() 

this is the error I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128) 

The error specifically points to f=filename.readlines() as the source of the error. Any idea what might be wrong? Thanks.

like image 327
r_e_cur Avatar asked Jan 27 '16 04:01

r_e_cur


People also ask

What does Readlines () method return in Python?

The readlines() method returns a list containing each line in the file as a list item.

What does UnicodeDecodeError mean in Python?

The Python "UnicodeDecodeError: 'ascii' codec can't decode byte in position" occurs when we use the ascii codec to decode bytes that were encoded using a different codec. To solve the error, specify the correct encoding, e.g. utf-8 .


2 Answers

I think the best answer (in Python 3) is to use the errors= parameter:

with open('evil_unicode.txt', 'r', errors='replace') as f:     lines = f.readlines() 

Proof:

>>> s = b'\xe5abc\nline2\nline3' >>> with open('evil_unicode.txt','wb') as f: ...     f.write(s) ... 16 >>> with open('evil_unicode.txt', 'r') as f: ...     lines = f.readlines() ... Traceback (most recent call last):   File "<stdin>", line 2, in <module>   File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode     (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte >>> with open('evil_unicode.txt', 'r', errors='replace') as f: ...     lines = f.readlines() ... >>> lines ['�abc\n', 'line2\n', 'line3'] >>> 

Note that the errors= can be replace or ignore. Here's what ignore looks like:

>>> with open('evil_unicode.txt', 'r', errors='ignore') as f: ...     lines = f.readlines() ... >>> lines ['abc\n', 'line2\n', 'line3'] 
like image 145
caleb Avatar answered Sep 17 '22 14:09

caleb


Your default encoding appears to be ASCII, where the input is more than likely UTF-8. When you hit non-ASCII bytes in the input, it's throwing the exception. It's not so much that readlines itself is responsible for the problem; rather, it's causing the read+decode to occur, and the decode is failing.

It's an easy fix though; the default open in Python 3 allows you to provide the known encoding of an input, replacing the default (ASCII in your case) with any other recognized encoding. Providing it allows you to keep reading as str (rather than the significantly different raw binary data bytes objects), while letting Python do the work of converting from raw disk bytes to true text data:

# Using with statement closes the file for us without needing to remember to close # explicitly, and closes even when exceptions occur with open(argfile, encoding='utf-8') as inf:     f = inf.readlines() 
like image 23
ShadowRanger Avatar answered Sep 16 '22 14:09

ShadowRanger