Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError: ('utf-8' codec) while reading a csv file [duplicate]

what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte 

my code is

 import pandas as pd  df = pd.read_csv("D:\ss.csv")  df.columns  #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')  df['True'] = df['True'] + 2     #making changes to one column of type float  df.to_csv("D:\ss.csv")       #updating that .csv      df1 = pd.read_csv("D:\ss.csv")   #again trying to read that csv  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte 

So please suggest how can i avoid the error and be able to read that csv again to a dataframe.

I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.

But i don't know what exactly should be changed.so need help.

like image 650
Satya Avatar asked Nov 20 '15 05:11

Satya


People also ask

How do I specify encoding in pandas?

str. encode() function is used to encode character string in the Series/Index using indicated encoding.

How do I check the encoding of a CSV file?

The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.

What is encoding in Read_csv?

Source from Kaggle character encoding. The Pandas read_csv() function has an argument call encoding that allows you to specify an encoding to use when reading a file.


2 Answers

Known encoding

If you know the encoding of the file you want to read in, you can use

pd.read_csv('filename.txt', encoding='encoding') 

These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

Unknown encoding

If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.

import chardet import pandas as pd  with open('filename.csv', 'rb') as f:     result = chardet.detect(f.read())  # or readline if the file is large   pd.read_csv('filename.csv', encoding=result['encoding']) 
like image 189
MaxNoe Avatar answered Sep 18 '22 15:09

MaxNoe


Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.

Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).

Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:

df = pd.read_csv(r"D:\ss.csv", encoding="cp1252") 

Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.

Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)

like image 22
rmunn Avatar answered Sep 18 '22 15:09

rmunn