I scrawled down the data and had to save the dataframe as utf-16 (Unicode) since the Latin/Spanish words were shown weird in the form of utf-8. I used the following code to save the dataframe:
df.to_csv("blogdata.csv", encoding = "utf-16", sep = "\t", index = False)
when I try to read the file to clean the data using the following code:
blogdata = pd.read_csv('c:/Users/hyoungm?Downloads/blogdata.csv')
it shows the following error.
UnicodeDecodeError Traceback (most recent call last) in () ----> 1 blogdata = pd.read_csv('C:/Users/hyoungm/Downloads/blogdata.csv')
...
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.cinit()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._get_header()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Please see my screenshot here:
I don't know either how to save the original data without losing those Laint/Spanish words within English sentences or how to read Unicode data file. Can anybody please help me with solving this issue?
Thank you very much!
There is a Python library which may help when the encoding is unknown: chardet
with open(filename, 'rb') as file:
print(chardet.detect(file.read()))
detect finds the encoding, and 'rb' will read the file in as binary
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With