Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding Error in Panda read_csv [duplicate]

Tags:

pandas

csv

utf-8

I'm attempting to read a CSV file into a Dataframe in Pandas. When I try to do that, I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 55: invalid start byte

This is from code:

import pandas as pd  location = r"C:\Users\khtad\Documents\test.csv"  df = pd.read_csv(location, header=0, quotechar='"') 

This is on a Windows 7 Enterprise Service Pack 1 machine and it seems to apply to every CSV file I create. In this particular case the binary from location 55 is 00101001 and location 54 is 01110011, if that matters.

Saving the file as UTF-8 with a text editor doesn't seem to help, either. Similarly, adding the param "encoding='utf-8' doesn't work, either--it returns the same error.

What is the most likely cause of this error and are there any workarounds other than abandoning the DataFrame construct for the moment and using the csv module to read in the CSV line-by-line?

like image 583
khtad Avatar asked May 26 '15 15:05

khtad


People also ask

What is encoding in read_csv?

Source from Kaggle character encoding. The Pandas read_csv() function has an argument call encoding that allows you to specify an encoding to use when reading a file.

What is Unicode error in Pandas?

pandas UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6785: invalid start byte. The error might have several different reasons: different encoding. bad symbols. corrupted file.

What encoding does Pandas use?

Fixing encoding errors in Pandas In fact, Pandas assumes that text is in UTF-8 format, because it is so common.


1 Answers

Try calling read_csv with encoding='latin1', encoding='iso-8859-1' or encoding='cp1252' (these are some of the various encodings found on Windows).

like image 139
maxymoo Avatar answered Oct 02 '22 05:10

maxymoo