In a Pandas DataFrame, I want to create a new column conditionally based on the value of another column. In my application, the DataFrame typically has a few million lines, and the number of unique conditional values is small, on the order of unity. Performance is extremely important: what is the fastest way to generate the new column?
I created an example case below, and tried and compared different methods already.
In the example, the conditional filling is represented by a
dictionary lookup based on the value of the column label
(here: one of 1, 2, 3
).
lookup_dict = {
1: 100, # arbitrary
2: 200, # arbitrary
3: 300, # arbitrary
}
I then expect my DataFrame to be filled as:
label output
0 3 300
1 2 200
2 3 300
3 3 300
4 2 200
5 2 200
6 1 100
7 1 100
Below are 6 different methods tested on 10M lines (parameter Nlines
in the test code):
pandas.groupby().apply()
pandas.groupby().indices.items()
pandas.Series.map
numpy.select
The full code is available at the end of the answer, with the runtimes of all methods. The output of every method is asserted to be equal before performances are compared.
pandas.groupby().apply()
I use pandas.groupby()
on the label
, then fills each block with the same value using apply()
.
def fill_output(r):
''' called by groupby().apply(): all r.label values are the same '''
r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
return r
df = df.groupby('label').apply(fill_output)
I get
>>> method_1_groupby ran in 2.29s (average over 3 iterations)
Note that groupby().apply() is ran twice on the first group to determine which code path to use (see Pandas #2936). This can slow things down for a small number of groups. I tricked the Method 1 can adding a first dummy group, but I didn't get much improvement.
pandas.groupby().indices.items()
Second is a variant: instead of using apply
I access the indices directy with groupby().indices.items()
. This ends up to be twice as fast as Method 1, and it's the method I've used for a long time
dgb = df.groupby('label')
for label, idx in dgb.indices.items():
df.loc[idx, 'output'] = lookup_dict[label]
Got:
method_2_indices ran in 1.21s (average over 3 iterations)
pandas.Series.map
I used Pandas.Series.map.
df['output'] = df.label.map(lookup_dict.get)
I had very good results in similar cases where the number of looked up values was comparable with the number of lines. In the present case, map
ends up being twice as slow as Method 1.
method_3_map ran in 3.07s (average over 3 iterations)
I attribute that to the small number of look up values, but there may just be an issue with the way I implemented it.
The 4th method is quite naive: I just loop over all labels and select the matching part of the DataFrame.
for label, value in lookup_dict.items():
df.loc[df.label == label, 'output'] = value
Surprisingly, though, I ended up with much faster results that in the previous cases. I expected the groupby
based solutions to be faster than this one, because Pandas has to make three comparisons with df.label == label
here. Results prove me wrong:
method_4_forloop ran in 0.54s (average over 3 iterations)
numpy.select
Fifth method uses the numpy select
function, based on this StackOverflow answer.
conditions = [df.label == k for k in lookup_dict.keys()]
choices = list(lookup_dict.values())
df['output'] = np.select(conditions, choices)
This yields the best results:
method_5_select ran in 0.29s (average over 3 iterations)
Eventually, I tried a numba
approach in Method 6.
Just for the sake of the example, the conditional filling values are hardcode in the compiled function. I don't know how to give Numba a list as a runtime constant:
@jit(int64[:](int64[:]), nopython=True)
def hardcoded_conditional_filling(column):
output = np.zeros_like(column)
i = 0
for c in column:
if c == 1:
output[i] = 100
elif c == 2:
output[i] = 200
elif c == 3:
output[i] = 300
i += 1
return output
df['output'] = hardcoded_conditional_filling(df.label.values)
I ended up with the best time, faster than Method 5 by 50%.
method_6_numba ran in 0.19s (average over 3 iterations)
I haven't implemented this one for the reason stated above: I don't know how to give Numba a list as a runtime constant without a major drop in performances.
import pandas as pd
import numpy as np
from timeit import timeit
from numba import jit, int64
lookup_dict = {
1: 100, # arbitrary
2: 200, # arbitrary
3: 300, # arbitrary
}
Nlines = int(1e7)
# Generate
label = np.round(np.random.rand(Nlines)*2+1).astype(np.int64)
df0 = pd.DataFrame(label, columns=['label'])
# Now the goal is to assign the look_up_dict values to a new column 'output'
# based on the value of label
# Method 1
# using groupby().apply()
def method_1_groupby(df):
def fill_output(r):
''' called by groupby().apply(): all r.label values are the same '''
#print(r.iloc[0]['label']) # activate to reveal the #2936 issue in Pandas
r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
return r
df = df.groupby('label').apply(fill_output)
return df
def method_2_indices(df):
dgb = df.groupby('label')
for label, idx in dgb.indices.items():
df.loc[idx, 'output'] = lookup_dict[label]
return df
def method_3_map(df):
df['output'] = df.label.map(lookup_dict.get)
return df
def method_4_forloop(df):
''' naive '''
for label, value in lookup_dict.items():
df.loc[df.label == label, 'output'] = value
return df
def method_5_select(df):
''' Based on answer from
https://stackoverflow.com/a/19913845/5622825
'''
conditions = [df.label == k for k in lookup_dict.keys()]
choices = list(lookup_dict.values())
df['output'] = np.select(conditions, choices)
return df
def method_6_numba(df):
''' This works, but it is hardcoded and i don't really know how
to make it compile with list as runtime constants'''
@jit(int64[:](int64[:]), nopython=True)
def hardcoded_conditional_filling(column):
output = np.zeros_like(column)
i = 0
for c in column:
if c == 1:
output[i] = 100
elif c == 2:
output[i] = 200
elif c == 3:
output[i] = 300
i += 1
return output
df['output'] = hardcoded_conditional_filling(df.label.values)
return df
df1 = method_1_groupby(df0)
df2 = method_2_indices(df0.copy())
df3 = method_3_map(df0.copy())
df4 = method_4_forloop(df0.copy())
df5 = method_5_select(df0.copy())
df6 = method_6_numba(df0.copy())
# make sure we havent modified the input (would bias the results)
assert 'output' not in df0.columns
# Test validity
assert (df1 == df2).all().all()
assert (df1 == df3).all().all()
assert (df1 == df4).all().all()
assert (df1 == df5).all().all()
assert (df1 == df6).all().all()
# Compare performances
Nites = 3
print('Compare performances for {0:.1g} lines'.format(Nlines))
print('-'*30)
for method in [
'method_1_groupby', 'method_2_indices',
'method_3_map', 'method_4_forloop',
'method_5_select', 'method_6_numba']:
print('{0} ran in {1:.2f}s (average over {2} iterations)'.format(
method,
timeit("{0}(df)".format(method), setup="from __main__ import df0, {0}; df=df0.copy()".format(method), number=Nites)/Nites,
Nites))
Output:
Compare performances for 1e+07 lines
------------------------------
method_1_groupby ran in 2.29s (average over 3 iterations)
method_2_indices ran in 1.21s (average over 3 iterations)
method_3_map ran in 3.07s (average over 3 iterations)
method_4_forloop ran in 0.54s (average over 3 iterations)
method_5_select ran in 0.29s (average over 3 iterations)
method_6_numba ran in 0.19s (average over 3 iterations)
I'd be interested in any other solution that could yield better performances. I was originally looking for Pandas based methods, but I accept numba/cython based solutions too.
Adding Chrisb's methods for comparison:
def method_3b_mapdirect(df):
''' Suggested by https://stackoverflow.com/a/51388828/5622825'''
df['output'] = df.label.map(lookup_dict)
return df
def method_7_take(df):
''' Based on answer from
https://stackoverflow.com/a/19913845/5622825
Exploiting that labels are continuous integers
'''
lookup_arr = np.array(list(lookup_dict.values()))
df['output'] = lookup_arr.take(df['label'] - 1)
return df
With runtimes of:
method_3_mapdirect ran in 0.23s (average over 3 iterations)
method_7_take ran in 0.11s (average over 3 iterations)
Which makes #3 faster than any other method (#6 aside), and the most elegant too. Use #7 if your user case is compatible.
You can create a conditional DataFrame column by checking multiple columns using numpy. select() function. The select() function is more capable than the previous methods. We can use it to give a set of conditions and a set of values.
I'd consider .map
(#3) the idiomatic way to do this - but don't pass the .get
- use the dictionary by itself, and should see a pretty significant improvement.
df = pd.DataFrame({'label': np.random.randint(, 4, size=1000000, dtype='i8')})
%timeit df['output'] = df.label.map(lookup_dict.get)
261 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df['output'] = df.label.map(lookup_dict)
69.6 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If the number of conditions is small, and the comparison cheap (i.e. ints and your lookup table), direct comparison of the values (4 and especially 5) is faster than .map
, but this wouldn't always true, e.g. if you had a set of strings.
If your lookup labels really are contigous integers, you can exploit this and lookup using a take
, which should be about as fast as numba. I think this is basically as fast as this can go - could write the the equivalent in cython, but won't be quicker.
%%timeit
lookup_arr = np.array(list(lookup_dict.values()))
df['output'] = lookup_arr.take(df['label'] - 1)
8.68 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With