Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError when reading a text file

Tags:

python

I am a beginner to Python (I am using 3.4). This is the relevant part of my code.

fileObject = open("countable nouns raw.txt", "rt")
bigString = fileObject.read()
fileObject.close()

Whenever I try to read this file I get:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 82273: character maps to <undefined>

I have been reading around and it seems to be something to do with my default encoding not matching the text file encoding. I've read in another post that you can use this method to read a file with a specific encoding:

import codecs
f = codecs.open("file.txt", "r", "utf-8")

But you have to know it in advance. The thing is I don't know how the text file is encoded. A few posts suggested using Chardet. I've installed it but I have no idea how to get it to read a text file.

Any ideas on how to get around this??

like image 442
Sid Avatar asked Oct 19 '22 09:10

Sid


1 Answers

There is no need to use codecs.open(); that's advice for Python 2.

In Python 3 open() takes an encoding argument:

fileObject = open("countable nouns raw.txt", "rt", encoding='utf8')

This does require that you know what codec was used for the file, of course. Generally speaking is no easy way for Python to figure that out; individual file formats may include codec information or have standardised on a given codec, but if all you have a generic text file you'll have to figure out what created it and what codec that used to write the data.

like image 113
Martijn Pieters Avatar answered Nov 04 '22 22:11

Martijn Pieters