Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: peculiar performance drop for inplace rename after dropna

I have reported this as an issue on pandas issues. In the meanwhile I post this here hoping to save others time, in case they encounter similar issues.

Upon profiling a process which needed to be optimized I found that renaming columns NOT inplace improves performance (execution time) by x120. Profiling indicates this is related to garbage collection (see below).

Furthermore, the expected performance is recovered by avoiding the dropna method.

The following short example demonstrates a factor x12:

import pandas as pd import numpy as np 

inplace=True

%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) df = (df1-df2).dropna() ## inplace rename: df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True) 

100 loops, best of 3: 15.6 ms per loop

first output line of %%prun:

ncalls tottime percall cumtime percall filename:lineno(function)

1  0.018 0.018 0.018 0.018 {gc.collect} 

inplace=False

%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) df = (df1-df2).dropna() ## avoid inplace: df = df.rename(columns={col:'d{}'.format(col) for col in df.columns}) 

1000 loops, best of 3: 1.24 ms per loop

avoid dropna

The expected performance is recovered by avoiding the dropna method:

%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) #no dropna: df = (df1-df2)#.dropna() ## inplace rename: df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True) 

1000 loops, best of 3: 865 µs per loop

%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) ## no dropna df = (df1-df2)#.dropna() ## avoid inplace: df = df.rename(columns={col:'d{}'.format(col) for col in df.columns}) 

1000 loops, best of 3: 902 µs per loop

like image 335
eldad-a Avatar asked Mar 20 '14 11:03

eldad-a


People also ask

Does Dropna drop the entire row?

DataFrame. dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.

Why you should probably never use pandas inplace true?

If you use chaining (which gives you major pandas style points), then you won't have to! inplace=True prevents the use of chaining because nothing is returned from the methods. That's a big stylistic blow because chaining is where pandas really comes to life.

Is Dropna inplace?

The dropna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame instead.

What does Dropna inplace true mean?

df.dropna(axis='index', how='all', inplace=True) In Pandas the above code means: Pandas create a copy of the original data. Performs the required operation on it. Assigns the results to the original data.


1 Answers

This is a copy of the explanation on github.

There is no guarantee that an inplace operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.

The reason for the difference in performance in this case is as follows.

The (df1-df2).dropna() call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy check because it could be a copy (but often is not).

This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.

You can not have this happen, by simply making a copy first.

df = (df1-df2).dropna().copy() 

followed by an inplace operation will be as performant as before.

My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.

like image 87
Jeff Avatar answered Sep 21 '22 03:09

Jeff