pandas read_csv encoding weird character

Question

I tried to read my dataset in text file format using pandas. However, some characters are not encoded correctly. I got ??? for apostrophe.

What should I do to encode my file correctly? I've tried

encoding = "utf8" but I got UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2044: unexpected end of data.
encoding = "latin1" but this gave me a lot of ???
encoding = "ISO-8859-1" or "ISO-8859-2" but this also gave me just like no encoding...

When I open my data in sublime, I got this character â€™.

UPDATED: But when I access the entry using loc I got something like \u0102\u02d8\xe2\x82\u0179\xc2\u015, \u0102\u02d8\xe2\x82\u0179\xe2\x84\u02d8

Andy Hayden · Accepted Answer

You may be able to determine the encoding with chardet:

$ pip install chardet

>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

The basic usage also suggests how you can use this to infer the encoding from large files e.g. files too large to read into memory - it'll read the file until it's confident enought about the encoding.

According to this answer you should try encoding="ISO-8859-2":

My guess is that your input is encoded as ISO-8859-2 which contains Ă as 0xC3.

Note: Sublime may not infer the encoding correctly either so you have to take it's output with a pinch of salt, it's best to check with your vendor (wherever you're getting the file from) what the actual encoding is...

pandas read_csv encoding weird character

Tags:

python

pandas

csv

encoding

utf-8

user3362840

1 Answers

Andy Hayden

Recent Activity

Donate For Us

pandas read_csv encoding weird character

Tags:

python

pandas

csv

encoding

utf-8

user3362840

1 Answers

Andy Hayden

Related questions

Recent Activity

Donate For Us