Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas skiprows beyond 900000 fails

Tags:

python

pandas

My csv file contains 6Million records and I am trying to split it into multiple smaller sized files by using skiprows.. My Pandas version is '0.12.0' and the code is

pd.read_csv(TRAIN_FILE, chunksize=50000, header=None, skiprows=999999, nrows=100000)

It works as long as skiprows is less than 900000. Any idea if it is expected ? If I do not use skiprows, my nrows can go upto 5Million records. Have not yet tried beyond that. will try this also.

tried csv splitter, but it does not work properly for the first entry, may be, because, each cell consists of multiple lines of code etc.

EDIT: I am able to split it into csv by reading the entire 7GB file using pandas read_csv and writing in parts to multiple csv files.

like image 751
user644745 Avatar asked Nov 30 '13 04:11

user644745


People also ask

What is pandas Badline error?

error_bad_linesbool, optional, default None. Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will be dropped from the DataFrame that is returned.

How many columns can pandas handle?

There isn't a set maximum of columns - the issue is that you've quite simply run out of available memory on your computer, unfortunately. One way to fix it is to get some more memory - but that obviously isn't a solid solution in the long run (might be quite expensive, too).

What does Index_col 0 mean in pandas?

index_col: This is to allow you to set which columns to be used as the index of the dataframe. The default value is None, and pandas will add a new column start from 0 to specify the index column. It can be set as a column name or column index, which will be used as the index column.

What does Error_bad_lines false do?

If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).


1 Answers

The problem seems to be that you are specifying both nrows and chunksize. At least in pandas 0.14.0 using

pandas.read_csv(filename, nrows=some_number, chunksize=another_number)

returns a Dataframe (reading the whole data), whereas

pandas.read_csv(filename, chunksize=another_number)

returns a TextFileReader that loads the file lazily.

Splitting a csv then works like this:

for chunk in pandas.read_csv(filename, chunksize=your_chunk_size):
    chunk.to_csv(some_filename)
like image 158
Matthias Ossadnik Avatar answered Oct 05 '22 13:10

Matthias Ossadnik