I am a bit confused about the argument copy
in DataFrame.merge()
after a co-worker asked me about that.
The docstring of DataFrame.merge()
states:
copy : boolean, default True If False, do not copy data unnecessarily
The pandas documentation states:
copy
: Always copy data (defaultTrue
) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
The docstring kind of implies that copying the data is not necessary and might be skipped nearly always. The documention on the other hand says, that copying data can't be avoided in many cases.
My questions are:
merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
The merge() method updates the content of two DataFrame by merging them together, using the specified method(s). Use the parameters to control which values to keep and which to replace.
Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.
By default Pandas merge method is case-sensitive.
Disclaimer: I'm not very experienced with pandas and this is the first time I dug into its source, so I can't guarantee that I'm not missing something in my below assessment.
The relevant bits of code have been recently refactored. I'll discuss the subject in terms of the current stable version 0.20, but I don't suspect functional changes compared to earlier versions.
The investigation starts with the source of merge
in pandas/core/reshape/merge.py (formerly pandas/tools/merge.py). Ignoring some doc-aware decorators:
def merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False): op = _MergeOperation(left, right, how=how, on=on, left_on=left_on, right_on=right_on, left_index=left_index, right_index=right_index, sort=sort, suffixes=suffixes, copy=copy, indicator=indicator) return op.get_result()
Calling merge
will pass on the copy
parameter to the constructor of class _MergeOperation
, then calls its get_result()
method. The first few lines with context:
# TODO: transformations?? # TODO: only copy DataFrames when modification necessary class _MergeOperation(object): [...]
Now that second comment is highly suspicious. Moving on, the copy
kwarg is bound to an eponymous instance attribute, which only seems to reappear once within the class:
result_data = concatenate_block_managers( [(ldata, lindexers), (rdata, rindexers)], axes=[llabels.append(rlabels), join_index], concat_axis=0, copy=self.copy)
We can then track down the concatenate_block_managers
function in pandas/core/internals.py that just passes on the copy
kwarg to concatenate_join_units
.
We reached the final resting place of the original copy
keyword argument in concatenate_join_units
:
if len(to_concat) == 1: # Only one block, nothing to concatenate. concat_values = to_concat[0] if copy and concat_values.base is not None: concat_values = concat_values.copy() else: concat_values = _concat._concat_compat(to_concat, axis=concat_axis)
As you can see, the only thing that copy
does is rebind a copy of concat_values
here to the same name in the special case of concatenation when there's really nothing to concatenate.
Now, at this point my lack of pandas knowledge starts to show, because I'm not really sure what exactly is going on this deep inside the call stack. But the above hot-potato scheme with the copy
keyword argument ending in that no-op-like branch of a concatenation function is perfectly consistent with the "TODO" comment above, the documentation quoted in the question:
copy
: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
(emphasis mine), and the related discussion on an old issue:
IIRC I think the copy parameter only matters here is its a trivial merge and you actually do want it copied (kind I like a reindex with the same index)
Based on these hints I suspect that in the very vast majority of real use cases copying is inevitable, and the copy
keyword argument is never used. However, since for the small number of exceptions skipping a copy step might improve performance (without leading to any performance impact whatsoever for the majority of use cases in the mean time), the choice was implemented.
I suspect that the rationale is something like this: the upside of not doing a copy unless necessary (which is only possible in a very special few cases) is that the code avoids some memory allocations and copies in this case, but not returning a copy in a very special few cases might lead to unexpected surprises if one doesn't expect that mutating the return value of merge
could in any way affect the original dataframe. So the default value of the copy
keyword argument is True
, thus the user only doesn't get a copy from merge
if they explicitly volunteer for this (but even then they'll still likely end up with a copy).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With