Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In pandas, is inplace = True considered harmful, or not?

Tags:

python

pandas

This has been discussed before, but with conflicting answers:

  • in-place is good!
  • in-place is bad!

What I'm wondering is:

  • Why is inplace = False the default behavior?
  • When is it good to change it? (well, I'm allowed to change it, so I guess there's a reason).
  • Is this a safety issue? that is, can an operation fail/misbehave due to inplace = True?
  • Can I know in advance if a certain inplace = True operation will "really" be carried out in-place?

My take so far:

  • Many Pandas operations have an inplace parameter, always defaulting to False, meaning the original DataFrame is untouched, and the operation returns a new DF.
  • When setting inplace = True, the operation might work on the original DF, but it might still work on a copy behind the scenes, and just reassign the reference when done.

pros of inplace = True:

  • Can be both faster and less memory hogging (the first link shows reset_index() runs twice as fast and uses half the peak memory!).

pros of inplace = False :

  • Allows chained/functional syntax: df.dropna().rename().sum()... which is nice, and offers a chance for lazy evaluation or a more efficient re-ordering (though I don't think Pandas is doing this).
  • When using inplace = True on an object which is potentially a slice/view of an underlying DF, Pandas has to do a SettingWithCopy check, which is expensive. inplace = False avoids this.
  • Consistent & predictable behavior behind the scenes.

So, putting the copy-vs-view issue aside, it seems more performant to always use inplace = True, unless specifically writing a chained statement. But that's not the default Pandas opt for, so what am I missing?

like image 851
OmerB Avatar asked Aug 08 '17 14:08

OmerB


People also ask

What does inplace true do in Pandas?

Using the inplace=True keyword in a pandas method changes the default behaviour such that the operation on the dataframe doesn't return anything, it instead 'modifies the underlying data' (more on that later). It mutates the actual object which you apply it to.

What is the use of statement inplace true?

When inplace = True is used, it performs operation on data and nothing is returned. When inplace=False is used, it performs operation on data and returns a new copy of data.

What is the benefit of using inplace parameter in data handling task?

At its core, the inplace parameter helps you decide how you want to affect the underlying data of the Pandas object. Do you want to make a change to the dataframe object you are working on and overwrite what was there before?

Is Pandas apply inplace?

Does the pandas apply() method have an inplace parameter? No, the apply() method doesn't contain an inplace parameter, unlike these pandas methods which have an inplace parameter: df.

Is inplace = true considered harmful in pandas?

In pandas, is inplace = True considered harmful, or not? TLDR; Yes, yes it is. inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits

What does inplace = false mean in pandas?

When using inplace = True on an object which is potentially a slice/view of an underlying DF, Pandas has to do a SettingWithCopy check, which is expensive. inplace = False avoids this. Consistent & predictable behavior behind the scenes.

What is pandas inplace and how does it work?

This won’t be news to you if you’ve got experience using the inplace keyword, but just a quick recap of how it works. Inplace is a parameter accepted by a number of pandas methods which affects the behaviour of how the method runs.

What is the difference between inplace = true and inplace=false?

When inplace=True is passed, the data is renamed in place (it returns nothing), so you'd use: When inplace=False is passed (this is the default value, so isn't necessary), performs the operation and returns a copy of the object, so you'd use: In pandas, is inplace = True considered harmful, or not?


Video Answer


2 Answers

In pandas, is inplace = True considered harmful, or not?

Yes, it is. Not just harmful. Quite harmful. This GitHub issue is proposing the inplace argument be deprecated api-wide sometime in the near future. In a nutshell, here's everything wrong with the inplace argument:

  • inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
  • inplace does not work with method chaining
  • inplace can lead to the dreaded SettingWithCopyWarning when called on a DataFrame column, and may sometimes fail to update the column in-place

The pain points above are all common pitfall for beginners, so removing this option will simplify the API greatly.


We take a look at the points above in more depth.

Performance
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In general, there are no performance benefits to using inplace=True (but there are rare exceptions which are mostly a result of implementation detail in the library and should not be used as a crutch to advocate for this argument's usage). Most in-place and out-of-place versions of a method create a copy of the data anyway, with the in-place version automatically assigning the copy back. The copy cannot be avoided.

Method Chaining
inplace=True also hinders method chaining. Contrast the working of

result = df.some_function1().reset_index().some_function2()

As opposed to

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

Unintended Pitfalls
One final caveat to keep in mind is that calling inplace=True can trigger the SettingWithCopyWarning:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame

Which can cause unexpected behavior.

like image 125
cs95 Avatar answered Oct 18 '22 02:10

cs95


If inplace was the default then the DataFrame would be mutated for all names that currently reference it.

A simple example, say I have a df:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

Now it's very important that DataFrame retains that row order - let's say it's from a data source where insertion order is key for instance.

However, I now need to do some operations which require a different sort order:

def f(frame):
    df = frame.sort_values('a')
    # if we did frame.sort_values('a', inplace=True) here without
    # making it explicit - our caller is going to wonder what happened
    # do something
    return df

That's fine - my original df remains the same. However, if inplace=True were the default then my original df will now be sorted as a side-effect of f() in which I'd have to trust the caller to remember to not do something in place I'm not expecting instead of deliberately doing something in place... So it's better that anything that can mutate an object in place does so explicitly to at least make it more obvious what's happened and why.

Even with basic Python builtin mutables, you can observe this:

data = [3, 2, 1]

def f(lst):
    lst.sort()
    # I meant lst = sorted(lst)
    for item in lst:
        print(item)

f(data)

for item in data:
    print(item)

# huh!? What happened to my data - why's it not 3, 2, 1?     
like image 39
Jon Clements Avatar answered Oct 18 '22 02:10

Jon Clements