Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to solve UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte in python

Tags:

python

I scrawled down the data and had to save the dataframe as utf-16 (Unicode) since the Latin/Spanish words were shown weird in the form of utf-8. I used the following code to save the dataframe:

 df.to_csv("blogdata.csv", encoding = "utf-16", sep = "\t", index = False)

when I try to read the file to clean the data using the following code:

 blogdata = pd.read_csv('c:/Users/hyoungm?Downloads/blogdata.csv')

it shows the following error.


UnicodeDecodeError Traceback (most recent call last) in () ----> 1 blogdata = pd.read_csv('C:/Users/hyoungm/Downloads/blogdata.csv')

...

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.cinit()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._get_header()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Please see my screenshot here: enter image description here

I don't know either how to save the original data without losing those Laint/Spanish words within English sentences or how to read Unicode data file. Can anybody please help me with solving this issue?

Thank you very much!

like image 680
Hyoungeun Moon Avatar asked Apr 07 '19 20:04

Hyoungeun Moon


1 Answers

There is a Python library which may help when the encoding is unknown: chardet

with open(filename, 'rb') as file:
    print(chardet.detect(file.read()))

detect finds the encoding, and 'rb' will read the file in as binary

like image 103
Helen Batson Avatar answered Oct 28 '22 07:10

Helen Batson