Fastest way to create a pandas column conditionally

Tags:

In a Pandas DataFrame, I want to create a new column conditionally based on the value of another column. In my application, the DataFrame typically has a few million lines, and the number of unique conditional values is small, on the order of unity. Performance is extremely important: what is the fastest way to generate the new column?

I created an example case below, and tried and compared different methods already. In the example, the conditional filling is represented by a dictionary lookup based on the value of the column label (here: one of 1, 2, 3).

lookup_dict = {
    1: 100,   # arbitrary
    2: 200,   # arbitrary
    3: 300,   # arbitrary
    }

I then expect my DataFrame to be filled as:

       label  output
0      3     300
1      2     200
2      3     300
3      3     300
4      2     200
5      2     200
6      1     100
7      1     100

Below are 6 different methods tested on 10M lines (parameter Nlines in the test code):

method 1: pandas.groupby().apply()
method 2: pandas.groupby().indices.items()
method 3: pandas.Series.map
method 4: for loop on labels
method 5: numpy.select
method 6: numba

The full code is available at the end of the answer, with the runtimes of all methods. The output of every method is asserted to be equal before performances are compared.

method 1: `pandas.groupby().apply()`

I use pandas.groupby() on the label, then fills each block with the same value using apply().

def fill_output(r):
    ''' called by groupby().apply(): all r.label values are the same '''
    r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
    return r

df = df.groupby('label').apply(fill_output)

I get

>>> method_1_groupby ran in 2.29s (average over 3 iterations)

Note that groupby().apply() is ran twice on the first group to determine which code path to use (see Pandas #2936). This can slow things down for a small number of groups. I tricked the Method 1 can adding a first dummy group, but I didn't get much improvement.

method 2: `pandas.groupby().indices.items()`

Second is a variant: instead of using apply I access the indices directy with groupby().indices.items(). This ends up to be twice as fast as Method 1, and it's the method I've used for a long time

dgb = df.groupby('label')
for label, idx in dgb.indices.items():
    df.loc[idx, 'output'] = lookup_dict[label]

Got:

method_2_indices ran in 1.21s (average over 3 iterations)

method 3: `pandas.Series.map`

I used Pandas.Series.map.

df['output'] = df.label.map(lookup_dict.get)

I had very good results in similar cases where the number of looked up values was comparable with the number of lines. In the present case, map ends up being twice as slow as Method 1.

method_3_map ran in 3.07s (average over 3 iterations)

I attribute that to the small number of look up values, but there may just be an issue with the way I implemented it.

method 4: for loop on labels

The 4th method is quite naive: I just loop over all labels and select the matching part of the DataFrame.

for label, value in lookup_dict.items():
    df.loc[df.label == label, 'output'] = value

Surprisingly, though, I ended up with much faster results that in the previous cases. I expected the groupby based solutions to be faster than this one, because Pandas has to make three comparisons with df.label == label here. Results prove me wrong:

method_4_forloop ran in 0.54s (average over 3 iterations)

method 5: `numpy.select`

Fifth method uses the numpy select function, based on this StackOverflow answer.

conditions = [df.label == k for k in lookup_dict.keys()]
choices = list(lookup_dict.values())

df['output'] = np.select(conditions, choices)

This yields the best results:

method_5_select ran in 0.29s (average over 3 iterations)

Eventually, I tried a numba approach in Method 6.

method 6: numba

Just for the sake of the example, the conditional filling values are hardcode in the compiled function. I don't know how to give Numba a list as a runtime constant:

@jit(int64[:](int64[:]), nopython=True)
def hardcoded_conditional_filling(column):
    output = np.zeros_like(column)
    i = 0
    for c in column:
        if c == 1:
            output[i] = 100
        elif c == 2:
            output[i] = 200
        elif c == 3:
            output[i] = 300
        i += 1
    return output

df['output'] = hardcoded_conditional_filling(df.label.values)

I ended up with the best time, faster than Method 5 by 50%.

method_6_numba ran in 0.19s (average over 3 iterations)

I haven't implemented this one for the reason stated above: I don't know how to give Numba a list as a runtime constant without a major drop in performances.

Full code

import pandas as pd
import numpy as np
from timeit import timeit
from numba import jit, int64

lookup_dict = {
        1: 100,   # arbitrary
        2: 200,   # arbitrary
        3: 300,   # arbitrary
        }

Nlines = int(1e7)

# Generate 
label = np.round(np.random.rand(Nlines)*2+1).astype(np.int64)
df0 = pd.DataFrame(label, columns=['label'])

# Now the goal is to assign the look_up_dict values to a new column 'output' 
# based on the value of label

# Method 1
# using groupby().apply()

def method_1_groupby(df):

    def fill_output(r):
        ''' called by groupby().apply(): all r.label values are the same '''
        #print(r.iloc[0]['label'])   # activate to reveal the #2936 issue in Pandas
        r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
        return r

    df = df.groupby('label').apply(fill_output)
    return df 

def method_2_indices(df):

    dgb = df.groupby('label')
    for label, idx in dgb.indices.items():
        df.loc[idx, 'output'] = lookup_dict[label]

    return df

def method_3_map(df):

    df['output'] = df.label.map(lookup_dict.get)

    return df

def method_4_forloop(df):
    ''' naive '''

    for label, value in lookup_dict.items():
        df.loc[df.label == label, 'output'] = value

    return df

def method_5_select(df):
    ''' Based on answer from 
    https://stackoverflow.com/a/19913845/5622825
    '''

    conditions = [df.label == k for k in lookup_dict.keys()]
    choices = list(lookup_dict.values())

    df['output'] = np.select(conditions, choices)

    return df

def method_6_numba(df):
    ''' This works, but it is hardcoded and i don't really know how
    to make it compile with list as runtime constants'''


    @jit(int64[:](int64[:]), nopython=True)
    def hardcoded_conditional_filling(column):
        output = np.zeros_like(column)
        i = 0
        for c in column:
            if c == 1:
                output[i] = 100
            elif c == 2:
                output[i] = 200
            elif c == 3:
                output[i] = 300
            i += 1
        return output

    df['output'] = hardcoded_conditional_filling(df.label.values)

    return df

df1 = method_1_groupby(df0)
df2 = method_2_indices(df0.copy())
df3 = method_3_map(df0.copy())
df4 = method_4_forloop(df0.copy())
df5 = method_5_select(df0.copy())
df6 = method_6_numba(df0.copy())

# make sure we havent modified the input (would bias the results)
assert 'output' not in df0.columns 

# Test validity
assert (df1 == df2).all().all()
assert (df1 == df3).all().all()
assert (df1 == df4).all().all()
assert (df1 == df5).all().all()
assert (df1 == df6).all().all()

# Compare performances
Nites = 3
print('Compare performances for {0:.1g} lines'.format(Nlines))
print('-'*30)
for method in [
               'method_1_groupby', 'method_2_indices', 
               'method_3_map', 'method_4_forloop', 
               'method_5_select', 'method_6_numba']:
    print('{0} ran in {1:.2f}s (average over {2} iterations)'.format(
            method, 
            timeit("{0}(df)".format(method), setup="from __main__ import df0, {0}; df=df0.copy()".format(method), number=Nites)/Nites,
            Nites))

Output:

Compare performances for 1e+07 lines
------------------------------
method_1_groupby ran in 2.29s (average over 3 iterations)
method_2_indices ran in 1.21s (average over 3 iterations)
method_3_map ran in 3.07s (average over 3 iterations)
method_4_forloop ran in 0.54s (average over 3 iterations)
method_5_select ran in 0.29s (average over 3 iterations)
method_6_numba ran in 0.19s (average over 3 iterations)

I'd be interested in any other solution that could yield better performances. I was originally looking for Pandas based methods, but I accept numba/cython based solutions too.

Edit

Adding Chrisb's methods for comparison:

def method_3b_mapdirect(df):
    ''' Suggested by https://stackoverflow.com/a/51388828/5622825'''

    df['output'] = df.label.map(lookup_dict)

    return df

def method_7_take(df):
    ''' Based on answer from 
    https://stackoverflow.com/a/19913845/5622825

    Exploiting that labels are continuous integers
    '''

    lookup_arr = np.array(list(lookup_dict.values()))
    df['output'] = lookup_arr.take(df['label'] - 1)

    return df

With runtimes of:

method_3_mapdirect ran in 0.23s (average over 3 iterations)
method_7_take ran in 0.11s (average over 3 iterations)

Which makes #3 faster than any other method (#6 aside), and the most elegant too. Use #7 if your user case is compatible.

445

asked Jul 17 '18 18:07

erwanp

1 Answers

I'd consider .map (#3) the idiomatic way to do this - but don't pass the .get - use the dictionary by itself, and should see a pretty significant improvement.

df = pd.DataFrame({'label': np.random.randint(, 4, size=1000000, dtype='i8')})

%timeit df['output'] = df.label.map(lookup_dict.get)
261 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df['output'] = df.label.map(lookup_dict)
69.6 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

If the number of conditions is small, and the comparison cheap (i.e. ints and your lookup table), direct comparison of the values (4 and especially 5) is faster than .map, but this wouldn't always true, e.g. if you had a set of strings.

If your lookup labels really are contigous integers, you can exploit this and lookup using a take, which should be about as fast as numba. I think this is basically as fast as this can go - could write the the equivalent in cython, but won't be quicker.

%%timeit
lookup_arr = np.array(list(lookup_dict.values()))
df['output'] = lookup_arr.take(df['label'] - 1)
8.68 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

116

answered Sep 28 '22 14:09

chrisb

Related questions
                            
                                When should I use `wait` instead of `communicate` in subprocess?
                            
                                Eliminate white edges in Matplotlib/Basemap pcolor plot
                            
                                call C++ using Eigen Library function in python
                            
                                Julia Dataframes vs Python pandas
                            
                                Force Return of "View" rather than copy in Pandas?
                            
                                Is it possible to wrap a function from a shared library using F2PY?
                            
                                multiprocessing.Pool hangs if child causes a segmentation fault
                            
                                SerializerClass field on Serializer save from primary key
                            
                                Python: Feed and parse stream of data to and from external program with additional input and output files
                            
                                Diffie-Hellman (to RC4) with Wincrypt From Python
                            
                                Access Azure blob storage from within an Azure ML experiment
                            
                                Equivalent of source() of R in Python
                            
                                Arranging letters in the most pronounceable way?
                            
                                Debugging in Python: Show last N executed lines
                            
                                Python regex module vs re module - pattern mismatch
                            
                                Django CORS Access-Control-Allow-Origin missing
                            
                                Dependencies between files with pytest-dependency?
                            
                                Spark is only using one worker machine when more are available
                            
                                cx_Freeze: “No module named 'codecs'” Windows 10
                            
                                How to efficiently pass function through?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to create a pandas column conditionally

Tags:

performance

python

pandas

numpy

cython

method 1: `pandas.groupby().apply()`

method 2: `pandas.groupby().indices.items()`

method 3: `pandas.Series.map`

method 4: for loop on labels

method 5: `numpy.select`

method 6: numba

Full code

Edit

erwanp

People also ask

1 Answers

chrisb

Recent Activity

Donate For Us

Fastest way to create a pandas column conditionally

Tags:

performance

python

pandas

numpy

cython

method 1: pandas.groupby().apply()

method 2: pandas.groupby().indices.items()

method 3: pandas.Series.map

method 4: for loop on labels

method 5: numpy.select

method 6: numba

Full code

Edit

erwanp

People also ask

1 Answers

chrisb

Related questions

Recent Activity

Donate For Us

method 1: `pandas.groupby().apply()`

method 2: `pandas.groupby().indices.items()`

method 3: `pandas.Series.map`

method 5: `numpy.select`