Pandas: peculiar performance drop for inplace rename after dropna

Tags:

I have reported this as an issue on pandas issues. In the meanwhile I post this here hoping to save others time, in case they encounter similar issues.

Upon profiling a process which needed to be optimized I found that renaming columns NOT inplace improves performance (execution time) by x120. Profiling indicates this is related to garbage collection (see below).

Furthermore, the expected performance is recovered by avoiding the dropna method.

The following short example demonstrates a factor x12:

import pandas as pd import numpy as np

inplace=True

%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) df = (df1-df2).dropna() ## inplace rename: df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

100 loops, best of 3: 15.6 ms per loop

first output line of %%prun:

ncalls tottime percall cumtime percall filename:lineno(function)
1  0.018 0.018 0.018 0.018 {gc.collect} 

inplace=False

%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) df = (df1-df2).dropna() ## avoid inplace: df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 1.24 ms per loop

avoid dropna

The expected performance is recovered by avoiding the dropna method:

%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) #no dropna: df = (df1-df2)#.dropna() ## inplace rename: df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

1000 loops, best of 3: 865 µs per loop

%%timeit np.random.seed(0) r,c = (7,3) t = np.random.rand(r) df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) indx = np.random.choice(range(r),r/3, replace=False) t[indx] = np.random.rand(len(indx)) df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t) ## no dropna df = (df1-df2)#.dropna() ## avoid inplace: df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 902 µs per loop

335

asked Mar 20 '14 11:03

eldad-a

1 Answers

This is a copy of the explanation on github.

There is no guarantee that an inplace operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.

The reason for the difference in performance in this case is as follows.

The (df1-df2).dropna() call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy check because it could be a copy (but often is not).

This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.

You can not have this happen, by simply making a copy first.

df = (df1-df2).dropna().copy()

followed by an inplace operation will be as performant as before.

My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.

answered Sep 21 '22 03:09

Jeff

Related questions
                            
                                Get the same hash value for a Pandas DataFrame each time
                            
                                Evaluate multiple scores on sklearn cross_val_score
                            
                                Clearing Tensorflow GPU memory after model execution
                            
                                AttributeError when using "import dateutil" and "dateutil.parser.parse()" but no problems when using "from dateutil import parser"
                            
                                pandas.io.json.json_normalize with very nested json
                            
                                Reversing a regular expression in Python
                            
                                How do I use Logging in the Django Debug Toolbar?
                            
                                Why is 'True == not False' a syntax error in Python?
                            
                                setting the default string value of Python's collections.defaultdict
                            
                                Logging requests to django-rest-framework
                            
                                Preserving column order in Python Pandas DataFrame
                            
                                Determine if given class attribute is a property or not, Python object
                            
                                Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?
                            
                                Matplotlib not using latex font while text.usetex==True
                            
                                python numpy ndarray element-wise mean
                            
                                How to access the type arguments of typing.Generic?
                            
                                Django queryset filter for blank FileField?
                            
                                In what way is grequests asynchronous?
                            
                                RuntimeWarning: divide by zero encountered in log

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: peculiar performance drop for inplace rename after dropna

Tags:

performance

python

pandas

in-place