I tried to read my dataset in text file format using pandas. However, some characters are not encoded correctly. I got ??? for apostrophe.
What should I do to encode my file correctly? I've tried
encoding = "utf8" but I got UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2044: unexpected end of data.
encoding = "latin1" but this gave me a lot of ???
encoding = "ISO-8859-1" or "ISO-8859-2" but this also gave me just like no encoding...
When I open my data in sublime, I got this character ’.
UPDATED: But when I access the entry using loc I got something like \u0102\u02d8\xe2\x82\u0179\xc2\u015, \u0102\u02d8\xe2\x82\u0179\xe2\x84\u02d8
You may be able to determine the encoding with chardet:
$ pip install chardet
>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}
The basic usage also suggests how you can use this to infer the encoding from large files e.g. files too large to read into memory - it'll read the file until it's confident enought about the encoding.
According to this answer you should try encoding="ISO-8859-2":
My guess is that your input is encoded as ISO-8859-2 which contains Ă as
0xC3.
Note: Sublime may not infer the encoding correctly either so you have to take it's output with a pinch of salt, it's best to check with your vendor (wherever you're getting the file from) what the actual encoding is...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With