UnicodeDecodeError: ('utf-8' codec) while reading a csv file [duplicate]

Tags:

what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte

my code is

 import pandas as pd  df = pd.read_csv("D:\ss.csv")  df.columns  #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')  df['True'] = df['True'] + 2     #making changes to one column of type float  df.to_csv("D:\ss.csv")       #updating that .csv      df1 = pd.read_csv("D:\ss.csv")   #again trying to read that csv  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte

So please suggest how can i avoid the error and be able to read that csv again to a dataframe.

I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.

But i don't know what exactly should be changed.so need help.

650

asked Nov 20 '15 05:11

Satya

2 Answers

Known encoding

If you know the encoding of the file you want to read in, you can use

pd.read_csv('filename.txt', encoding='encoding')

These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

Unknown encoding

If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.

import chardet import pandas as pd  with open('filename.csv', 'rb') as f:     result = chardet.detect(f.read())  # or readline if the file is large   pd.read_csv('filename.csv', encoding=result['encoding'])

189

answered Sep 18 '22 15:09

MaxNoe

Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.

Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).

Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:

df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")

Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.

Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)

answered Sep 18 '22 15:09

rmunn

Related questions
                            
                                how to efficiently get the k bigger elements of a list in python
                            
                                Use a string to call function in Python [duplicate]
                            
                                OpenCV: Invert a mask?
                            
                                How to use terminal color palette with curses
                            
                                Combine pandas DataFrame query() method with isin()
                            
                                Python pandas: remove everything after a delimiter in a string
                            
                                Creating a salt in python
                            
                                cumulative distribution plots python
                            
                                Scientific notation colorbar in matplotlib
                            
                                How to filter files (with known type) from os.walk?
                            
                                Get data from pandas into a SQL server with PYODBC
                            
                                PySide - PyQt : How to make set QTableWidget column width as proportion of the available space?
                            
                                Matrix Multiplication in pure Python?
                            
                                python - get list of tuples first index?
                            
                                how to loop through httprequest post variables in python
                            
                                Python and Django OperationalError (2006, 'MySQL server has gone away')
                            
                                Python parsing bracketed blocks
                            
                                Convert multi-dimensional list to a 1D list in Python
                            
                                SciPy build/install Mac Osx
                            
                                Easy way to convert a unicode list to a list containing python strings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UnicodeDecodeError: ('utf-8' codec) while reading a csv file [duplicate]

Tags:

python

pandas

python-unicode

utf-8