Consider this pandas example where I'm calculating column C
by multiplying A
with B
and a float
if a certain condition is fulfilled using apply
with a lambda
function:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9],'B':[9,8,7,6,5,4,3,2,1]})
df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)
The expected result would be:
A B C
0 1 9 1.0
1 2 8 2.0
2 3 7 3.0
3 4 6 4.0
4 5 5 2.5
5 6 4 2.4
6 7 3 2.1
7 8 2 1.6
8 9 1 0.9
The problem is that this code is slow and I need to do this operation on a dataframe with around 56 million rows.
The %timeit
-result of the above lambda operation is:
1000 loops, best of 3: 1.63 ms per loop
Going from the calculation time and also the memory usage when doing this on my large dataframe I presume this operation uses intermediary series while doing the calculations.
I tried to formulate it in different ways including using temporary columns, but every alternative solution I came up with is even slower.
Is there a way to get the result I need in a different and faster way, e.g. by using numpy
?
For performance, you might be better off working with NumPy array and using np.where
-
a = df.values # Assuming you have two columns A and B
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
Runtime test
def numpy_based(df):
a = df.values # Assuming you have two columns A and B
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
Timings -
In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [272]: %timeit numpy_based(df)
1000 loops, best of 3: 380 µs per loop
In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.39 ms per loop
In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 1.12 ms per loop
In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 1.19 ms per loop
Closer look
Let's take a closer look at NumPy's number crunching capability and compare with pandas into the mix -
# Extract out as array (its a view, so not really expensive
# .. as compared to the later computations themselves)
In [291]: a = df.values
In [296]: %timeit df.values
10000 loops, best of 3: 107 µs per loop
Case #1 : Work with NumPy array and use numpy.where :
In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
10000 loops, best of 3: 86.5 µs per loop
Again, assigning into a new column : df['C']
would not be very expensive either -
In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
1000 loops, best of 3: 323 µs per loop
Case #2 : Work with pandas dataframe and use its .where
method (no NumPy)
In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.4 ms per loop
Case #3 : Work with pandas dataframe (no NumPy array), but use numpy.where
-
In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 764 µs per loop
Case #4 : Work with pandas dataframe again (no NumPy array), but use numpy.where
-
In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 830 µs per loop
pure pandas
using pd.Series.where
df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
A B C
0 1 9 1.0
1 2 8 2.0
2 3 7 3.0
3 4 6 4.0
4 5 5 2.5
5 6 4 2.4
6 7 3 2.1
7 8 2 1.6
8 9 1 0.9
Pandas is a great tool for data manipulation but runs on a single CPU core by default. In addition, Pandas is built to run vectorized API functions on entire columns or datasets in one sweep, but apply
runs custom user code. The other answers avoid the use of apply
with custom code but this may not be possible/practical in general. If processing large datasets with apply
is a pain point for you, you should consider an acceleration and scaling solution such as Bodo. Bodo directly compiles your apply
code to optimize it in ways that Pandas cannot. In addition to vectorizing your code, Bodo provides automatic parallelization. You can run your code up to 4 cores with the Bodo community edition (which is free to use). Here is a link to the Bodo installation instruction: https://docs.bodo.ai/latest/source/installation_and_setup/install.html
I generated a similar dataset to yours but with 20 million rows and ran the code with regular Pandas on one core and with Bodo on 4 cores. With regular Pandas, it takes about 6.5 minutes to run your code while with Bodo’s community edition it takes around half a second.
#data generation
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,10,size=(20000000, 2)), columns=list('AB'))
df.to_parquet("data.pq")
Regular Pandas:
import pandas as pd
import time
start = time.time()
df = pd.read_parquet("data.pq")
df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)
end = time.time()
print("computation time: ", end - start)
print(df.head())
output:
computation time: 378.3832001686096
A B C
0 3 5 1.5
1 8 6 8.0
2 1 7 1.0
3 8 1 0.8
4 4 8 4.0
With Bodo:
%%px
import pandas as pd
import time
import bodo
@bodo.jit(distributed = ['df'])
def apply():
start = time.time()
df = pd.read_parquet("data.pq")
df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)
end = time.time()
print("computation time: ", end - start)
print(df.head())
return df
df = apply()
output:
[stdout:0]
computation time: 0.3610380489999443
A B C
0 3 5 1.5
1 8 6 8.0
2 1 7 1.0
3 8 1 0.8
4 4 8 4.0
Disclaimer: I work as a data scientist advocate in Bodo.ai.
Using numpy.where
:
df['C'] = numpy.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With