Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to create a pandas column conditionally

In a Pandas DataFrame, I want to create a new column conditionally based on the value of another column. In my application, the DataFrame typically has a few million lines, and the number of unique conditional values is small, on the order of unity. Performance is extremely important: what is the fastest way to generate the new column?

I created an example case below, and tried and compared different methods already. In the example, the conditional filling is represented by a dictionary lookup based on the value of the column label (here: one of 1, 2, 3).

lookup_dict = {
    1: 100,   # arbitrary
    2: 200,   # arbitrary
    3: 300,   # arbitrary
    }

I then expect my DataFrame to be filled as:

       label  output
0      3     300
1      2     200
2      3     300
3      3     300
4      2     200
5      2     200
6      1     100
7      1     100

Below are 6 different methods tested on 10M lines (parameter Nlines in the test code):

  • method 1: pandas.groupby().apply()
  • method 2: pandas.groupby().indices.items()
  • method 3: pandas.Series.map
  • method 4: for loop on labels
  • method 5: numpy.select
  • method 6: numba

The full code is available at the end of the answer, with the runtimes of all methods. The output of every method is asserted to be equal before performances are compared.

method 1: pandas.groupby().apply()

I use pandas.groupby() on the label, then fills each block with the same value using apply().

def fill_output(r):
    ''' called by groupby().apply(): all r.label values are the same '''
    r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
    return r

df = df.groupby('label').apply(fill_output)

I get

>>> method_1_groupby ran in 2.29s (average over 3 iterations)

Note that groupby().apply() is ran twice on the first group to determine which code path to use (see Pandas #2936). This can slow things down for a small number of groups. I tricked the Method 1 can adding a first dummy group, but I didn't get much improvement.

method 2: pandas.groupby().indices.items()

Second is a variant: instead of using apply I access the indices directy with groupby().indices.items(). This ends up to be twice as fast as Method 1, and it's the method I've used for a long time

dgb = df.groupby('label')
for label, idx in dgb.indices.items():
    df.loc[idx, 'output'] = lookup_dict[label]

Got:

method_2_indices ran in 1.21s (average over 3 iterations)

method 3: pandas.Series.map

I used Pandas.Series.map.

df['output'] = df.label.map(lookup_dict.get)

I had very good results in similar cases where the number of looked up values was comparable with the number of lines. In the present case, map ends up being twice as slow as Method 1.

method_3_map ran in 3.07s (average over 3 iterations)

I attribute that to the small number of look up values, but there may just be an issue with the way I implemented it.

method 4: for loop on labels

The 4th method is quite naive: I just loop over all labels and select the matching part of the DataFrame.

for label, value in lookup_dict.items():
    df.loc[df.label == label, 'output'] = value

Surprisingly, though, I ended up with much faster results that in the previous cases. I expected the groupby based solutions to be faster than this one, because Pandas has to make three comparisons with df.label == label here. Results prove me wrong:

method_4_forloop ran in 0.54s (average over 3 iterations)

method 5: numpy.select

Fifth method uses the numpy select function, based on this StackOverflow answer.

conditions = [df.label == k for k in lookup_dict.keys()]
choices = list(lookup_dict.values())

df['output'] = np.select(conditions, choices)

This yields the best results:

method_5_select ran in 0.29s (average over 3 iterations)

Eventually, I tried a numba approach in Method 6.

method 6: numba

Just for the sake of the example, the conditional filling values are hardcode in the compiled function. I don't know how to give Numba a list as a runtime constant:

@jit(int64[:](int64[:]), nopython=True)
def hardcoded_conditional_filling(column):
    output = np.zeros_like(column)
    i = 0
    for c in column:
        if c == 1:
            output[i] = 100
        elif c == 2:
            output[i] = 200
        elif c == 3:
            output[i] = 300
        i += 1
    return output

df['output'] = hardcoded_conditional_filling(df.label.values)

I ended up with the best time, faster than Method 5 by 50%.

method_6_numba ran in 0.19s (average over 3 iterations)

I haven't implemented this one for the reason stated above: I don't know how to give Numba a list as a runtime constant without a major drop in performances.


Full code

import pandas as pd
import numpy as np
from timeit import timeit
from numba import jit, int64

lookup_dict = {
        1: 100,   # arbitrary
        2: 200,   # arbitrary
        3: 300,   # arbitrary
        }

Nlines = int(1e7)

# Generate 
label = np.round(np.random.rand(Nlines)*2+1).astype(np.int64)
df0 = pd.DataFrame(label, columns=['label'])

# Now the goal is to assign the look_up_dict values to a new column 'output' 
# based on the value of label

# Method 1
# using groupby().apply()

def method_1_groupby(df):

    def fill_output(r):
        ''' called by groupby().apply(): all r.label values are the same '''
        #print(r.iloc[0]['label'])   # activate to reveal the #2936 issue in Pandas
        r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
        return r

    df = df.groupby('label').apply(fill_output)
    return df 

def method_2_indices(df):

    dgb = df.groupby('label')
    for label, idx in dgb.indices.items():
        df.loc[idx, 'output'] = lookup_dict[label]

    return df

def method_3_map(df):

    df['output'] = df.label.map(lookup_dict.get)

    return df

def method_4_forloop(df):
    ''' naive '''

    for label, value in lookup_dict.items():
        df.loc[df.label == label, 'output'] = value

    return df

def method_5_select(df):
    ''' Based on answer from 
    https://stackoverflow.com/a/19913845/5622825
    '''

    conditions = [df.label == k for k in lookup_dict.keys()]
    choices = list(lookup_dict.values())

    df['output'] = np.select(conditions, choices)

    return df

def method_6_numba(df):
    ''' This works, but it is hardcoded and i don't really know how
    to make it compile with list as runtime constants'''


    @jit(int64[:](int64[:]), nopython=True)
    def hardcoded_conditional_filling(column):
        output = np.zeros_like(column)
        i = 0
        for c in column:
            if c == 1:
                output[i] = 100
            elif c == 2:
                output[i] = 200
            elif c == 3:
                output[i] = 300
            i += 1
        return output

    df['output'] = hardcoded_conditional_filling(df.label.values)

    return df

df1 = method_1_groupby(df0)
df2 = method_2_indices(df0.copy())
df3 = method_3_map(df0.copy())
df4 = method_4_forloop(df0.copy())
df5 = method_5_select(df0.copy())
df6 = method_6_numba(df0.copy())

# make sure we havent modified the input (would bias the results)
assert 'output' not in df0.columns 

# Test validity
assert (df1 == df2).all().all()
assert (df1 == df3).all().all()
assert (df1 == df4).all().all()
assert (df1 == df5).all().all()
assert (df1 == df6).all().all()

# Compare performances
Nites = 3
print('Compare performances for {0:.1g} lines'.format(Nlines))
print('-'*30)
for method in [
               'method_1_groupby', 'method_2_indices', 
               'method_3_map', 'method_4_forloop', 
               'method_5_select', 'method_6_numba']:
    print('{0} ran in {1:.2f}s (average over {2} iterations)'.format(
            method, 
            timeit("{0}(df)".format(method), setup="from __main__ import df0, {0}; df=df0.copy()".format(method), number=Nites)/Nites,
            Nites))

Output:

Compare performances for 1e+07 lines
------------------------------
method_1_groupby ran in 2.29s (average over 3 iterations)
method_2_indices ran in 1.21s (average over 3 iterations)
method_3_map ran in 3.07s (average over 3 iterations)
method_4_forloop ran in 0.54s (average over 3 iterations)
method_5_select ran in 0.29s (average over 3 iterations)
method_6_numba ran in 0.19s (average over 3 iterations)

I'd be interested in any other solution that could yield better performances. I was originally looking for Pandas based methods, but I accept numba/cython based solutions too.


Edit

Adding Chrisb's methods for comparison:

def method_3b_mapdirect(df):
    ''' Suggested by https://stackoverflow.com/a/51388828/5622825'''

    df['output'] = df.label.map(lookup_dict)

    return df

def method_7_take(df):
    ''' Based on answer from 
    https://stackoverflow.com/a/19913845/5622825

    Exploiting that labels are continuous integers
    '''

    lookup_arr = np.array(list(lookup_dict.values()))
    df['output'] = lookup_arr.take(df['label'] - 1)

    return df

With runtimes of:

method_3_mapdirect ran in 0.23s (average over 3 iterations)
method_7_take ran in 0.11s (average over 3 iterations)

Which makes #3 faster than any other method (#6 aside), and the most elegant too. Use #7 if your user case is compatible.

like image 445
erwanp Avatar asked Jul 17 '18 18:07

erwanp


People also ask

How do I add a conditional column to a data frame?

You can create a conditional DataFrame column by checking multiple columns using numpy. select() function. The select() function is more capable than the previous methods. We can use it to give a set of conditions and a set of values.


1 Answers

I'd consider .map (#3) the idiomatic way to do this - but don't pass the .get - use the dictionary by itself, and should see a pretty significant improvement.

df = pd.DataFrame({'label': np.random.randint(, 4, size=1000000, dtype='i8')})

%timeit df['output'] = df.label.map(lookup_dict.get)
261 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df['output'] = df.label.map(lookup_dict)
69.6 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

If the number of conditions is small, and the comparison cheap (i.e. ints and your lookup table), direct comparison of the values (4 and especially 5) is faster than .map, but this wouldn't always true, e.g. if you had a set of strings.

If your lookup labels really are contigous integers, you can exploit this and lookup using a take, which should be about as fast as numba. I think this is basically as fast as this can go - could write the the equivalent in cython, but won't be quicker.

%%timeit
lookup_arr = np.array(list(lookup_dict.values()))
df['output'] = lookup_arr.take(df['label'] - 1)
8.68 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 116
chrisb Avatar answered Sep 28 '22 14:09

chrisb