My csv file contains 6Million records and I am trying to split it into multiple smaller sized files by using skiprows.. My Pandas version is '0.12.0' and the code is
pd.read_csv(TRAIN_FILE, chunksize=50000, header=None, skiprows=999999, nrows=100000)
It works as long as skiprows is less than 900000. Any idea if it is expected ? If I do not use skiprows, my nrows can go upto 5Million records. Have not yet tried beyond that. will try this also.
tried csv splitter, but it does not work properly for the first entry, may be, because, each cell consists of multiple lines of code etc.
EDIT: I am able to split it into csv by reading the entire 7GB file using pandas read_csv and writing in parts to multiple csv files.
error_bad_linesbool, optional, default None. Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will be dropped from the DataFrame that is returned.
There isn't a set maximum of columns - the issue is that you've quite simply run out of available memory on your computer, unfortunately. One way to fix it is to get some more memory - but that obviously isn't a solid solution in the long run (might be quite expensive, too).
index_col: This is to allow you to set which columns to be used as the index of the dataframe. The default value is None, and pandas will add a new column start from 0 to specify the index column. It can be set as a column name or column index, which will be used as the index column.
If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).
The problem seems to be that you are specifying both nrows
and chunksize
. At least in pandas 0.14.0 using
pandas.read_csv(filename, nrows=some_number, chunksize=another_number)
returns a Dataframe
(reading the whole data), whereas
pandas.read_csv(filename, chunksize=another_number)
returns a TextFileReader that loads the file lazily.
Splitting a csv then works like this:
for chunk in pandas.read_csv(filename, chunksize=your_chunk_size):
chunk.to_csv(some_filename)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With