Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is np.where faster than pd.apply

Sample code is here

import pandas as pd
import numpy as np

df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'],
                   'Spending' : [130,22,313,46]})

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)

In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop

In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop

Question taken from here: https://stackoverflow.com/a/41166160/3027854

like image 585
Vikash Singh Avatar asked Dec 15 '16 14:12

Vikash Singh


People also ask

Why is NumPy so much faster than pandas?

NumPy provides n-dimensional arrays, Data Type (dtype), etc. as objects. The indexing of pandas series is significantly slower than the indexing of NumPy arrays. The indexing of NumPy arrays is much faster than the indexing of Pandas arrays.

Is NP vectorize faster than apply?

Approximately, 7000 times faster than the apply method, and 130 times faster than the numpy vectorize method!

Is NumPy faster than Dataframe?

If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.

Is PD apply faster than for loop?

The apply() Method — 811 times faster This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here). We can use apply with a Lambda function. All we have to do it to specify the axis.


2 Answers

I think np.where is faster because use numpy array vectorized way and pandas is built on this arrays.

df.apply is slow, because it use loops.

vectorize operations are the fastest, then cython routines and then apply.

See this answer with better explanation of developer of pandas - Jeff.

like image 111
jezrael Avatar answered Oct 19 '22 17:10

jezrael


Just adding a visualization approach to what have been said.

Profile and total cumulative time of df.apply : df.apply profile

We can see that the cimulative time is 13.8s.

Profile and total cumulative time of np.where : np.where profile

Here, the cumulative time is 5.44ms which is 2500 times faster than df.apply

The figure above were obtained using the library snakeviz. Here is a link to the library.

SnakeViz displays profiles as a sunburst in which functions are represented as arcs. A root function is a circle at the middle, with functions it calls around, then the functions those functions call, and so on. The amount of time spent inside a function is represented by the angular width of the arc. An arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a skinny arc represents a function that is using hardly any time at all.

like image 27
MMF Avatar answered Oct 19 '22 17:10

MMF