Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is Pandas concat an in-place function?

Tags:

python

pandas

I guess this question needs some insight into the implementation of concat.

Say, I have 30 files, 1G each, and I can only use up to 32 G memory. I loaded the files into a list of DataFrames, called 'list_of_pieces'. This list_of_pieces should be ~ 30G in size, right?

if I do 'pd.concat(list_of_pieces)', does concat allocate another 30G (or maybe 10G 15G) in the heap and do some operations, or it run the concatation 'in-place' without allocating new memory?

anyone knows this?

Thanks!

like image 487
James Bond Avatar asked Jun 07 '13 11:06

James Bond


People also ask

What does concat function do in pandas?

Concatenate pandas objects along a particular axis with optional set logic along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

Does pandas apply work in place?

No, the apply() method doesn't contain an inplace parameter, unlike these pandas methods which have an inplace parameter: df.

What is the difference between append and concat in pandas?

Append function will add rows of second data frame to first dataframe iteratively one by one. Concat function will do a single operation to finish the job, which makes it faster than append().

Which is faster concat or append?

concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.


1 Answers

The answer is no, this is not an in-place operation; np.concatenate is used under the hood, see here: Concatenate Numpy arrays without copying

A better approach to the problem is to write each of these pieces to an HDFStore table, see here: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables for docs, and here: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore for some recipies.

Then you can select whatever portions (or even the whole set) as needed (by query or even row number)

Certain types of operations can even be done when the data is on-disk: https://github.com/pydata/pandas/issues/3202?source=cc, and here: http://pytables.github.io/usersguide/libref/expr_class.html#

like image 150
Jeff Avatar answered Nov 02 '22 22:11

Jeff