Sample code is here
import pandas as pd
import numpy as np
df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'],
'Spending' : [130,22,313,46]})
#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop
In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop
Question taken from here: https://stackoverflow.com/a/41166160/3027854
NumPy provides n-dimensional arrays, Data Type (dtype), etc. as objects. The indexing of pandas series is significantly slower than the indexing of NumPy arrays. The indexing of NumPy arrays is much faster than the indexing of Pandas arrays.
Approximately, 7000 times faster than the apply method, and 130 times faster than the numpy vectorize method!
If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.
The apply() Method — 811 times faster This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here). We can use apply with a Lambda function. All we have to do it to specify the axis.
I think np.where
is faster because use numpy
array vectorized way and pandas is built on this arrays.
df.apply
is slow, because it use loops
.
vectorize
operations are the fastest, then cython routines
and then apply
.
See this answer with better explanation of developer of pandas - Jeff.
Just adding a visualization approach to what have been said.
Profile and total cumulative time of df.apply
:
We can see that the cimulative time is 13.8s
.
Profile and total cumulative time of np.where
:
Here, the cumulative time is 5.44ms
which is 2500
times faster than df.apply
The figure above were obtained using the library snakeviz
.
Here is a link to the library.
SnakeViz displays profiles as a sunburst in which functions are represented as arcs. A root function is a circle at the middle, with functions it calls around, then the functions those functions call, and so on. The amount of time spent inside a function is represented by the angular width of the arc. An arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a skinny arc represents a function that is using hardly any time at all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With