Pandas skiprows beyond 900000 fails

Tags:

python

pandas

My csv file contains 6Million records and I am trying to split it into multiple smaller sized files by using skiprows.. My Pandas version is '0.12.0' and the code is

pd.read_csv(TRAIN_FILE, chunksize=50000, header=None, skiprows=999999, nrows=100000)

It works as long as skiprows is less than 900000. Any idea if it is expected ? If I do not use skiprows, my nrows can go upto 5Million records. Have not yet tried beyond that. will try this also.

tried csv splitter, but it does not work properly for the first entry, may be, because, each cell consists of multiple lines of code etc.

EDIT: I am able to split it into csv by reading the entire 7GB file using pandas read_csv and writing in parts to multiple csv files.

751

asked Nov 30 '13 04:11

user644745

1 Answers

The problem seems to be that you are specifying both nrows and chunksize. At least in pandas 0.14.0 using

pandas.read_csv(filename, nrows=some_number, chunksize=another_number)

returns a Dataframe (reading the whole data), whereas

pandas.read_csv(filename, chunksize=another_number)

returns a TextFileReader that loads the file lazily.

Splitting a csv then works like this:

for chunk in pandas.read_csv(filename, chunksize=your_chunk_size):
    chunk.to_csv(some_filename)

158

answered Oct 05 '22 13:10

Matthias Ossadnik

Related questions
                            
                                error while using Django-websocket
                            
                                Translate default errors in Flask-WTF
                            
                                How to show and update a bitmap FAST in Python?
                            
                                Is there a NumPy C API function which will reset the layout flags?
                            
                                NumPy: np.lexsort with fuzzy/tolerant comparisons
                            
                                Organising cython source files and their tests (with nosetests)
                            
                                Making Tkinter window a child of non-Tkinter window
                            
                                unable to access chrome message passing API from selenium execute_script
                            
                                Create inheritance graphs/trees for Django templates
                            
                                Python Multiprocessing Module: calling an instance method within a process
                            
                                Error while packing keyring module with PY2APP on MAC OSX
                            
                                How to write a list with a nested dictionary to a csv file?
                            
                                How to create top-level documentation in sphinx automatically from code?
                            
                                Select entry from array given another value
                            
                                Kartograph python script generates SVG with incorrect lat/long coords
                            
                                convert image from base64 to image and save in database in django
                            
                                vim loading python on linux
                            
                                How can I efficiently transfer data from a NumPy array to a QPolygonF when using PySide?
                            
                                Change table name in Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With