I have reported this as an issue on pandas issues. In the meanwhile I post this here hoping to save others time, in case they encounter similar issues.
Upon profiling a process which needed to be optimized I found that renaming columns NOT inplace improves performance (execution time) by x120. Profiling indicates this is related to garbage collection (see below).
Furthermore, the expected performance is recovered by avoiding the dropna method.
The following short example demonstrates a factor x12:
import pandas as pd import numpy as np
%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) df = (df1-df2).dropna() ## inplace rename: df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)
100 loops, best of 3: 15.6 ms per loop
first output line of %%prun
:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.018 0.018 0.018 0.018 {gc.collect}
%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) df = (df1-df2).dropna() ## avoid inplace: df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})
1000 loops, best of 3: 1.24 ms per loop
The expected performance is recovered by avoiding the dropna
method:
%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) #no dropna: df = (df1-df2)#.dropna() ## inplace rename: df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)
1000 loops, best of 3: 865 µs per loop
%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) ## no dropna df = (df1-df2)#.dropna() ## avoid inplace: df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})
1000 loops, best of 3: 902 µs per loop
DataFrame. dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.
If you use chaining (which gives you major pandas style points), then you won't have to! inplace=True prevents the use of chaining because nothing is returned from the methods. That's a big stylistic blow because chaining is where pandas really comes to life.
The dropna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame instead.
df.dropna(axis='index', how='all', inplace=True) In Pandas the above code means: Pandas create a copy of the original data. Performs the required operation on it. Assigns the results to the original data.
This is a copy of the explanation on github.
There is no guarantee that an inplace
operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.
The reason for the difference in performance in this case is as follows.
The (df1-df2).dropna()
call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy
check because it could be a copy (but often is not).
This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.
You can not have this happen, by simply making a copy first.
df = (df1-df2).dropna().copy()
followed by an inplace
operation will be as performant as before.
My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With