Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Unicode and "\xe2\x80\x99" driving me batty

So I have a .txt file from Google Docs containing some lines from David Foster Wallace's "Oblivion". Using:

with open("oblivion.txt", "r", 0) as bookFile:
    wordList = []
    for line in bookFile:
        wordList.append(line)

and returning & printing the wordList I get:

"surgery on the crow\xe2\x80\x99s feet around her eyes." 

(and it truncates a lot of the text). However, if instead of appending the wordList I simply

for line in bookFile:
    print line

everything turns out fine! The same goes for .read()'ing the file - the resulting str doesn't have the crazy byte representation, but then I can't manipulate it the way I want to.

Where do I .encode() or .decode() or what? Using Python 2 because 3 was giving me some I/O buffer error. Thanks.

like image 939
Luke McPuke Avatar asked Jul 01 '17 10:07

Luke McPuke


1 Answers

Try open with encoding as utf-8:

with open("oblivion.txt", "r", encoding='utf-8') as bookFile:
    wordList = bookFile.readlines()
like image 114
Rahul Avatar answered Sep 23 '22 04:09

Rahul