UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

Question

I'm trying to load a csv file using pd.read_csv but I get the following unicode error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

bobince · Accepted Answer

Unfortunately, CSV files have no built-in method of signalling character encoding.

read_csv defaults to guessing that the bytes in the CSV file represent text encoded in the UTF-8 encoding. This results in UnicodeDecodeError if the file is using some other encoding that results in bytes that don't happen to be a valid UTF-8 sequence. (If they by luck did also happen to be valid UTF-8, you wouldn't get the error, but you'd still get wrong input for non-ASCII characters, which would be worse really.)

It's up to you to specify what encoding is in play, which requires some knowledge (or guessing) of where it came from. For example if it came from MS Excel on a western install of Windows, it would probably be Windows code page 1252 and you could read it with:

pd.read_csv('../filename.csv', encoding='cp1252')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

Tags:

pandas

csv

unicode

python-unicode

load

Josephine M. Ho

Video Answer

1 Answers

bobince

Recent Activity

Donate For Us

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

Tags:

pandas

csv

unicode

python-unicode

load

Josephine M. Ho

Video Answer

1 Answers

bobince

Related questions

Recent Activity

Donate For Us