Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading parts of ~13000 row CSV file with pandas read_csv and nrows

I'm trying to read segments of a CSV file into a pandas DataFrame, and I'm running into trouble when I set nrows to more than a certain point. My CSV file is split up into different segments with different headers/types of data, so I've gone through the file and found the line numbers of the different segments, and saved the line numbers. When I try to do:

pd.io.parsers.read_csv('filename',skiprows=40, nrows=12646)

It works fine. Any more rows, and it throws an error:

CParserError: Error tokenizing data. C error: Expected 56 fields in line 13897, saw 71

It's true that line 13897 has that many rows, that's why I'm trying to use nrows and skiprows. I can find the last row that pandas will read and it doesn't look any different from the rest. Looking at the file in a hex editor I still don't see any difference.

I've also tried it with another CSV file, and I get similar results:

pd.io.parsers.read_csv('file2',skiprows=112, nrows=18524)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18188 entries, 0 to 18187

But:

pd.io.parsers.read_csv('file2',skiprows=112, nrows=18525)

gives:

CParserError: Error tokenizing data. C error: Expected 56 fields in line 19190, saw 71

Is there something I'm missing? Is there another way to do this?

I'm using: pandas-0.10.1.win-amd64-py3.3, numpy-MKL-1.7.1rc1.win-amd64-py3.3, and python-3.3.0.amd64 on Windows. I get the same issue with numpy-unoptimized-1.7.1rc1.win-amd64-py3.3.

like image 487
dooz Avatar asked Oct 22 '22 13:10

dooz


1 Answers

You can use warn_bad_lines and error_bad_lines to turn off bad line error & warning:

import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
pd.read_csv(data, warn_bad_lines=False, error_bad_lines=False)
like image 136
HYRY Avatar answered Nov 02 '22 23:11

HYRY