Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe read_csv on bad data

Tags:

python

pandas

csv

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the program to crash. This row is errored so I need a way to ignore the fact that it was an extra column. There is around 50 columns so hardcoding the headers and using names or usecols isn't preferable. I'll also possibly encounter this issue in other csv's and want a generic solution. I couldn't find anything in read_csv unfortunately. The code is as simple as this:

def loadCSV(filePath):     dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000)     datakeys = dataframe.keys();     return dataframe, datakeys 
like image 523
Fonti Avatar asked Oct 30 '15 16:10

Fonti


People also ask

Can pandas read compressed CSV?

read_csv() method. By assigning the compression argument in read_csv() method as zip, then pandas will first decompress the zip and then will create the dataframe from CSV file present in the zipped file.

Is read_csv faster than Read_excel?

Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.


1 Answers

pass error_bad_lines=False to skip erroneous rows:

error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

like image 175
EdChum Avatar answered Sep 19 '22 08:09

EdChum