Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why aren't Pandas operations in-place?

Pandas operations usually create a copy of the original dataframe. As some answers on SO point out, even when using inplace=True, a lot of operations still create a copy to operate on.

Now, I think I'd be called a madman if I told my colleagues that everytime I want to, for example, apply +2 to a list, I copy the whole thing before doing it. Yet, it's what Pandas does. Even simple operations such as append always reallocate the whole dataframe.

Having to reallocate and copy everything on every operation seems like a very inefficient way to go about operating on any data. It also makes operating on particularly large dataframes impossible, even if they fit in your RAM.

Furthermore, this does not seem to be a problem for Pandas developers or users, so much so that there's an open issue #16529 discussing the removal of the inplace parameter entirely, which has received mostly positive responses; some started getting deprecated since 1.0. It seems like I'm missing something. So, what am I missing?

What are the advantages of always copying the dataframe on operations, instead of executing them in-place whenever possible?

Note: I agree that method chaining is very neat, I use it all the time. However, I feel that "because we can method chain" is not the whole answer, since Pandas sometimes copies even in inplace=True methods, which are not meant to be chained. So, I'm looking some other answers for why this would be a reasonable default.

like image 978
Luiz Martins Avatar asked Nov 15 '21 04:11

Luiz Martins


1 Answers

As evidenced here in the pandas documentation, "... In general we like to favor immutability where sensible." The Pandas project is in the camp of preferring immutable (stateless) objects over mutable (objects with state) to guide programmers into creating more scalable / parallelizable data processing code. They are guiding the users by making the 'inplace=False' behavior the default.

In this software engineering stack exchange Peter Torok discusses the pros and cons between mutable and immutable object programming really nicely. https://softwareengineering.stackexchange.com/a/151735

In summary some software engineers feel that objects that are immutable (unchanging) lead to

  • less errors in the code - because object states are easy to lose track of and hard to track down
  • increased scalability - it is easier to write multithreaded code, since one thread will not inadvertently modify the value contained by an object in another thread
  • more concise code - since code is forced to be written in a functional programming and more mathematical style

I will agree that this does have it's inefficiencies since constantly making copies of the same objects for minor changes does not seem ideal. It has other benefits noted above.

like image 116
lane Avatar answered Oct 08 '22 08:10

lane