Sample code is here <pre class="prettyprint"><code>import pandas as pd import numpy as np df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'], 'Spending' : [130,22,313,46]}) #[400000 rows x 4 columns] df = pd.concat([df]*100000).reset_index(drop=True) In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B') 10 loops, best of 3: 21.6 ms per loop In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1) 1 loop, best of 3: 7.08 s per loop </code></pre> Question taken from here: https://stackoverflow.com/a/41166160/3027854

I think <code>np.where</code> is faster because use <code>numpy</code> array vectorized way and pandas is built on this arrays. <code>df.apply</code> is slow, because it use <code>loops</code>. <code>vectorize</code> operations are the fastest, then <code>cython routines</code> and then <code>apply</code>. See this answer with better explanation of developer of pandas - Jeff.

Just adding a visualization approach to what have been said. Profile and total cumulative time of <code>df.apply</code> : <img src="https://i.stack.imgur.com/2oi9R.png" alt="df.apply profile"> We can see that the cimulative time is <code>13.8s</code>. Profile and total cumulative time of <code>np.where</code> : <img src="https://i.stack.imgur.com/5ucEJ.png" alt="np.where profile"> Here, the cumulative time is <code>5.44ms</code> which is <code>2500</code> times faster than <code>df.apply</code> The figure above were obtained using the library <code>snakeviz</code>. Here is a link to the library. <blockquote> <blockquote> SnakeViz displays profiles as a sunburst in which functions are represented as arcs. A root function is a circle at the middle, with functions it calls around, then the functions those functions call, and so on. The amount of time spent inside a function is represented by the angular width of the arc. An arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a skinny arc represents a function that is using hardly any time at all. </blockquote> </blockquote>

Why is np.where faster than pd.apply

Tags:

python

pandas

dataframe

numpy

Sample code is here

import pandas as pd
import numpy as np

df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'],
                   'Spending' : [130,22,313,46]})

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)

In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop

In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop

Question taken from here: https://stackoverflow.com/a/41166160/3027854

585

asked Dec 15 '16 14:12

Vikash Singh

2 Answers

I think np.where is faster because use numpy array vectorized way and pandas is built on this arrays.

df.apply is slow, because it use loops.

vectorize operations are the fastest, then cython routines and then apply.

See this answer with better explanation of developer of pandas - Jeff.

111

answered Oct 19 '22 17:10

jezrael

Just adding a visualization approach to what have been said.

Profile and total cumulative time of df.apply : df.apply profile

We can see that the cimulative time is 13.8s.

Profile and total cumulative time of np.where : np.where profile

Here, the cumulative time is 5.44ms which is 2500 times faster than df.apply

The figure above were obtained using the library snakeviz. Here is a link to the library.

SnakeViz displays profiles as a sunburst in which functions are represented as arcs. A root function is a circle at the middle, with functions it calls around, then the functions those functions call, and so on. The amount of time spent inside a function is represented by the angular width of the arc. An arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a skinny arc represents a function that is using hardly any time at all.

answered Oct 19 '22 17:10

MMF

Related questions
                            
                                python saving multiple subplot figures to pdf
                            
                                plot arrays by row with matplotlib
                            
                                How to do a transpose a dataframe group by key on pandas?
                            
                                Numpy: Get rectangle area just the size of mask
                            
                                Mean or max pooling with masking support in Keras
                            
                                Select single item in MYSQLdb - Python
                            
                                How to compress a file with bzip2 in Python?
                            
                                Sort lists in a Pandas Dataframe column
                            
                                Python decorator logger
                            
                                Import Error: No module called magic yet python-magic is installed
                            
                                How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas
                            
                                different ylim for shared axes in pandas boxplot
                            
                                How to install scipy on windows 10?
                            
                                selenium wont work with Firefox or Chrome
                            
                                Seaborn's histrogram bin widths not extending to bin labels
                            
                                Passing multiple arguments in Python thread
                            
                                How to use Robust PCA output as principal-component (eigen)vectors from traditional PCA
                            
                                'Proper' rounding in Python, to 3 decimal places
                            
                                Unpacking a list in print for Python 2
                            
                                I cannot close Excel 2016 after executing a xlwings function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With