This has been discussed before, but with conflicting answers: <ul> <li>in-place is good!</li> <li>in-place is bad!</li> </ul> What I'm wondering is: <ul> <li>Why is <code>inplace = False</code> the default behavior?</li> <li>When is it good to change it? (well, I'm allowed to change it, so I guess there's a reason).</li> <li>Is this a safety issue? that is, can an operation fail/misbehave due to <code>inplace = True</code>?</li> <li>Can I know in advance if a certain <code>inplace = True</code> operation will "really" be carried out in-place?</li> </ul> <hr> <h3>My take so far:</h3> <ul> <li>Many Pandas operations have an <code>inplace</code> parameter, always defaulting to <code>False</code>, meaning the original DataFrame is untouched, and the operation returns a new DF.</li> <li>When setting <code>inplace = True</code>, the operation might work on the original DF, but it might still work on a copy behind the scenes, and just reassign the reference when done.</li> </ul> <h3>pros of <code>inplace = True</code>:</h3> <ul> <li>Can be both faster and less memory hogging (the first link shows <code>reset_index()</code> runs twice as fast and uses half the peak memory!).</li> </ul> <h3>pros of <code>inplace = False </code>:</h3> <ul> <li>Allows chained/functional syntax: <code>df.dropna().rename().sum()...</code> which is nice, and offers a chance for lazy evaluation or a more efficient re-ordering (though I don't think Pandas is doing this).</li> <li>When using <code>inplace = True</code> on an object which is potentially a slice/view of an underlying DF, Pandas has to do a <code>SettingWithCopy</code> check, which is expensive. <code>inplace = False</code> avoids this.</li> <li>Consistent & predictable behavior behind the scenes.</li> </ul> So, putting the copy-vs-view issue aside, it seems more performant to always use <code>inplace = True</code>, unless specifically writing a chained statement. But that's not the default Pandas opt for, so what am I missing?

<blockquote> <h3>In pandas, is inplace = True considered harmful, or not?</h3> </blockquote> Yes, it is. Not just harmful. Quite harmful. This GitHub issue is proposing the <code>inplace</code> argument be deprecated api-wide sometime in the near future. In a nutshell, here's everything wrong with the <code>inplace</code> argument: <ul> <li> <code>inplace</code>, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits</li> <li> <code>inplace</code> does not work with method chaining</li> <li> <code>inplace</code> can lead to the dreaded <code>SettingWithCopyWarning</code> when called on a DataFrame column, and may sometimes fail to update the column in-place</li> </ul> The pain points above are all common pitfall for beginners, so removing this option will simplify the API greatly. <hr> We take a look at the points above in more depth. Performance It is a common misconception that using <code>inplace=True</code> will lead to more efficient or optimized code. In general, there are no performance benefits to using <code>inplace=True</code> (but there are rare exceptions which are mostly a result of implementation detail in the library and should not be used as a crutch to advocate for this argument's usage). Most in-place and out-of-place versions of a method create a copy of the data anyway, with the in-place version automatically assigning the copy back. The copy cannot be avoided. Method Chaining <code>inplace=True</code> also hinders method chaining. Contrast the working of <pre class="prettyprint"><code>result = df.some_function1().reset_index().some_function2() </code></pre> As opposed to <pre class="prettyprint"><code>temp = df.some_function1() temp.reset_index(inplace=True) result = temp.some_function2() </code></pre> Unintended Pitfalls One final caveat to keep in mind is that calling <code>inplace=True</code> can trigger the <code>SettingWithCopyWarning</code>: <pre class="prettyprint"><code>df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']}) df2 = df[df['a'] > 1] df2['b'].replace({'x': 'abc'}, inplace=True) # SettingWithCopyWarning: # A value is trying to be set on a copy of a slice from a DataFrame </code></pre> Which can cause unexpected behavior.

If <code>inplace</code> was the default then the DataFrame would be mutated for all names that currently reference it. A simple example, say I have a <code>df</code>: <pre class="prettyprint"><code>df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']}) </code></pre> Now it's very important that DataFrame retains that row order - let's say it's from a data source where insertion order is key for instance. However, I now need to do some operations which require a different sort order: <pre class="prettyprint"><code>def f(frame): df = frame.sort_values('a') # if we did frame.sort_values('a', inplace=True) here without # making it explicit - our caller is going to wonder what happened # do something return df </code></pre> That's fine - my original <code>df</code> remains the same. However, if <code>inplace=True</code> were the default then my original <code>df</code> will now be sorted as a side-effect of <code>f()</code> in which I'd have to trust the caller to remember to not do something in place I'm not expecting instead of deliberately doing something in place... So it's better that anything that can mutate an object in place does so explicitly to at least make it more obvious what's happened and why. Even with basic Python builtin mutables, you can observe this: <pre class="prettyprint"><code>data = [3, 2, 1] def f(lst): lst.sort() # I meant lst = sorted(lst) for item in lst: print(item) f(data) for item in data: print(item) # huh!? What happened to my data - why's it not 3, 2, 1? </code></pre>

In pandas, is inplace = True considered harmful, or not?

My take so far:

Many Pandas operations have an inplace parameter, always defaulting to False, meaning the original DataFrame is untouched, and the operation returns a new DF.
When setting inplace = True, the operation might work on the original DF, but it might still work on a copy behind the scenes, and just reassign the reference when done.

pros of `inplace = True`:

Can be both faster and less memory hogging (the first link shows reset_index() runs twice as fast and uses half the peak memory!).

pros of `inplace = False` :

Allows chained/functional syntax: df.dropna().rename().sum()... which is nice, and offers a chance for lazy evaluation or a more efficient re-ordering (though I don't think Pandas is doing this).
When using inplace = True on an object which is potentially a slice/view of an underlying DF, Pandas has to do a SettingWithCopy check, which is expensive. inplace = False avoids this.
Consistent & predictable behavior behind the scenes.

So, putting the copy-vs-view issue aside, it seems more performant to always use inplace = True, unless specifically writing a chained statement. But that's not the default Pandas opt for, so what am I missing?

851

asked Aug 08 '17 14:08

OmerB

Video Answer

2 Answers

In pandas, is inplace = True considered harmful, or not?

Yes, it is. Not just harmful. Quite harmful. This GitHub issue is proposing the inplace argument be deprecated api-wide sometime in the near future. In a nutshell, here's everything wrong with the inplace argument:

inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
inplace does not work with method chaining
inplace can lead to the dreaded SettingWithCopyWarning when called on a DataFrame column, and may sometimes fail to update the column in-place

The pain points above are all common pitfall for beginners, so removing this option will simplify the API greatly.

We take a look at the points above in more depth.

Performance
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In general, there are no performance benefits to using inplace=True (but there are rare exceptions which are mostly a result of implementation detail in the library and should not be used as a crutch to advocate for this argument's usage). Most in-place and out-of-place versions of a method create a copy of the data anyway, with the in-place version automatically assigning the copy back. The copy cannot be avoided.

Method Chaining
inplace=True also hinders method chaining. Contrast the working of

result = df.some_function1().reset_index().some_function2()

As opposed to

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

Unintended Pitfalls
One final caveat to keep in mind is that calling inplace=True can trigger the SettingWithCopyWarning:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame

Which can cause unexpected behavior.

125

answered Oct 18 '22 02:10

cs95

If inplace was the default then the DataFrame would be mutated for all names that currently reference it.

A simple example, say I have a df:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

Now it's very important that DataFrame retains that row order - let's say it's from a data source where insertion order is key for instance.

However, I now need to do some operations which require a different sort order:

def f(frame):
    df = frame.sort_values('a')
    # if we did frame.sort_values('a', inplace=True) here without
    # making it explicit - our caller is going to wonder what happened
    # do something
    return df

That's fine - my original df remains the same. However, if inplace=True were the default then my original df will now be sorted as a side-effect of f() in which I'd have to trust the caller to remember to not do something in place I'm not expecting instead of deliberately doing something in place... So it's better that anything that can mutate an object in place does so explicitly to at least make it more obvious what's happened and why.

Even with basic Python builtin mutables, you can observe this:

data = [3, 2, 1]

def f(lst):
    lst.sort()
    # I meant lst = sorted(lst)
    for item in lst:
        print(item)

f(data)

for item in data:
    print(item)

# huh!? What happened to my data - why's it not 3, 2, 1?

answered Oct 18 '22 02:10

Jon Clements

Related questions
                            
                                Summing elements in a list
                            
                                TypeError: 'list' object is not callable in python
                            
                                Usage of unicode() and encode() functions in Python
                            
                                How to take partial screenshot with Selenium WebDriver in python?
                            
                                Python strip() multiple characters?
                            
                                Remove rows not .isin('X') [duplicate]
                            
                                python shuffling with a parameter to get the same result
                            
                                prevent plot from showing in jupyter notebook
                            
                                Get time of execution of a block of code in Python 2.7
                            
                                Linking a qtDesigner .ui file to python/pyqt?
                            
                                Counterintuitive behaviour of int() in python
                            
                                Is it possible to go into ipython from code?
                            
                                How to solve "AttributeError: module 'google.protobuf.descriptor' has no attribute '_internal_create_key"?
                            
                                How to find the groups of consecutive elements in a NumPy array
                            
                                How do I use a dictionary to update fields in Django models?
                            
                                awscli not added to path after installation
                            
                                Background thread with QThread in PyQt
                            
                                Insert some string into given string at given index in Python
                            
                                Python AttributeError: 'module' object has no attribute 'SSL_ST_INIT'
                            
                                Python script for Django app to access models without using manage.py shell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In pandas, is inplace = True considered harmful, or not?

Tags:

python

pandas

My take so far:

pros of `inplace = True`:

pros of `inplace = False` :

OmerB

People also ask

Video Answer

2 Answers