Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the exact downsides of copy=False in DataFrame.merge()?

Tags:

I am a bit confused about the argument copy in DataFrame.merge() after a co-worker asked me about that.

The docstring of DataFrame.merge() states:

copy : boolean, default True     If False, do not copy data unnecessarily 

The pandas documentation states:

copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.

The docstring kind of implies that copying the data is not necessary and might be skipped nearly always. The documention on the other hand says, that copying data can't be avoided in many cases.

My questions are:

  • What are those cases?
  • What are the downsides?
like image 893
moritzbracht Avatar asked Sep 01 '15 11:09

moritzbracht


People also ask

How avoid duplicates in Pandas merge?

merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.

What does DataFrame merge do?

The merge() method updates the content of two DataFrame by merging them together, using the specified method(s). Use the parameters to control which values to keep and which to replace.

Is Pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.

Is merge case sensitive?

By default Pandas merge method is case-sensitive.


1 Answers

Disclaimer: I'm not very experienced with pandas and this is the first time I dug into its source, so I can't guarantee that I'm not missing something in my below assessment.

The relevant bits of code have been recently refactored. I'll discuss the subject in terms of the current stable version 0.20, but I don't suspect functional changes compared to earlier versions.

The investigation starts with the source of merge in pandas/core/reshape/merge.py (formerly pandas/tools/merge.py). Ignoring some doc-aware decorators:

def merge(left, right, how='inner', on=None, left_on=None, right_on=None,           left_index=False, right_index=False, sort=False,           suffixes=('_x', '_y'), copy=True, indicator=False):     op = _MergeOperation(left, right, how=how, on=on, left_on=left_on,                          right_on=right_on, left_index=left_index,                          right_index=right_index, sort=sort, suffixes=suffixes,                          copy=copy, indicator=indicator)     return op.get_result() 

Calling merge will pass on the copy parameter to the constructor of class _MergeOperation, then calls its get_result() method. The first few lines with context:

# TODO: transformations?? # TODO: only copy DataFrames when modification necessary class _MergeOperation(object):     [...] 

Now that second comment is highly suspicious. Moving on, the copy kwarg is bound to an eponymous instance attribute, which only seems to reappear once within the class:

result_data = concatenate_block_managers(     [(ldata, lindexers), (rdata, rindexers)],     axes=[llabels.append(rlabels), join_index],     concat_axis=0, copy=self.copy) 

We can then track down the concatenate_block_managers function in pandas/core/internals.py that just passes on the copy kwarg to concatenate_join_units.

We reached the final resting place of the original copy keyword argument in concatenate_join_units:

if len(to_concat) == 1:     # Only one block, nothing to concatenate.     concat_values = to_concat[0]     if copy and concat_values.base is not None:         concat_values = concat_values.copy() else:     concat_values = _concat._concat_compat(to_concat, axis=concat_axis) 

As you can see, the only thing that copy does is rebind a copy of concat_values here to the same name in the special case of concatenation when there's really nothing to concatenate.

Now, at this point my lack of pandas knowledge starts to show, because I'm not really sure what exactly is going on this deep inside the call stack. But the above hot-potato scheme with the copy keyword argument ending in that no-op-like branch of a concatenation function is perfectly consistent with the "TODO" comment above, the documentation quoted in the question:

copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.

(emphasis mine), and the related discussion on an old issue:

IIRC I think the copy parameter only matters here is its a trivial merge and you actually do want it copied (kind I like a reindex with the same index)

Based on these hints I suspect that in the very vast majority of real use cases copying is inevitable, and the copy keyword argument is never used. However, since for the small number of exceptions skipping a copy step might improve performance (without leading to any performance impact whatsoever for the majority of use cases in the mean time), the choice was implemented.

I suspect that the rationale is something like this: the upside of not doing a copy unless necessary (which is only possible in a very special few cases) is that the code avoids some memory allocations and copies in this case, but not returning a copy in a very special few cases might lead to unexpected surprises if one doesn't expect that mutating the return value of merge could in any way affect the original dataframe. So the default value of the copy keyword argument is True, thus the user only doesn't get a copy from merge if they explicitly volunteer for this (but even then they'll still likely end up with a copy).

like image 118