I'm trying to read segments of a CSV file into a pandas DataFrame, and I'm running into trouble when I set nrows to more than a certain point. My CSV file is split up into different segments with different headers/types of data, so I've gone through the file and found the line numbers of the different segments, and saved the line numbers. When I try to do:
pd.io.parsers.read_csv('filename',skiprows=40, nrows=12646)
It works fine. Any more rows, and it throws an error:
CParserError: Error tokenizing data. C error: Expected 56 fields in line 13897, saw 71
It's true that line 13897 has that many rows, that's why I'm trying to use nrows and skiprows. I can find the last row that pandas will read and it doesn't look any different from the rest. Looking at the file in a hex editor I still don't see any difference.
I've also tried it with another CSV file, and I get similar results:
pd.io.parsers.read_csv('file2',skiprows=112, nrows=18524)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18188 entries, 0 to 18187
But:
pd.io.parsers.read_csv('file2',skiprows=112, nrows=18525)
gives:
CParserError: Error tokenizing data. C error: Expected 56 fields in line 19190, saw 71
Is there something I'm missing? Is there another way to do this?
I'm using: pandas-0.10.1.win-amd64-py3.3
, numpy-MKL-1.7.1rc1.win-amd64-py3.3
, and python-3.3.0.amd64
on Windows. I get the same issue with numpy-unoptimized-1.7.1rc1.win-amd64-py3.3
.
You can use warn_bad_lines
and error_bad_lines
to turn off bad line error & warning:
import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
pd.read_csv(data, warn_bad_lines=False, error_bad_lines=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With