Pandas - Explanation on apply function being slow

Tags:

pandas

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).

I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)

For my case, I have two kinds of tasks to do with the apply function.

First: apply with lookup dict query

f(p_id, p_dict):
    return p_dict[p_dict['ID'] == p_id]['value']

p_dict = DataFrame(...)  # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))

Second: apply with groupby

f(week_id, min_week_num, p_dict):
    return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()

f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])

I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.

My question is:

Is there any way to substitute apply in the two above cases?
Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.

213

asked Aug 01 '16 11:08

1 Answers

Concerning your first question, I can't say exactly why this instance is slow. But generally, apply does not take advantage of vectorization. Also, apply returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).

For your first method, I assume you are trying to fill a 'value' column in df using the p_dict as a lookup table. It is about 1000x faster to use pd.merge:

import string, sys

import numpy as np
import pandas as pd

##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
    return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]

# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})

# Apply the f1 method  as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop

# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)

>>> 1000 loops, best of 5: 826 µs per loop

Concerning the second task, we can quickly add a new column to p_dict that calculates a mean where the time window starts at min_week_num and ends at the week for that row in p_dict. This requires that p_dict is sorted by ascending order along the WEEK column. Then you can use pd.merge again.

I am assuming that min_week_num is 0 in the following example. But you could easily modify rolling_growing_mean to take a different value. The rolling_growing_mean method will run in O(n) since it conducts a fixed number of operations per iteration.

n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})

def rolling_growing_mean(values):
    out = np.empty(len(values))
    out[0] = values[0]
    # Time window for taking mean grows each step
    for i, v in enumerate(values[1:]):
        out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
    return out

p_dict['Means'] = rolling_growing_mean(p_dict['value'])

df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')

answered Oct 31 '22 10:10

andrew

Related questions
                            
                                Python classes: Inheritance vs Instantiation
                            
                                How do you get the display width of combined Unicode characters in Python 3?
                            
                                Iterating through a Spark RDD
                            
                                Different behavior in python script and python idle?
                            
                                Flask CORS - no Access-control-allow-origin header present on a redirect()
                            
                                Why use __unicode__(self) method for django 1.7+? [closed]
                            
                                In PyQt, what is the best way to share data between the main window and a thread
                            
                                How do I correctly inherit templates in flask that use bootstrap?
                            
                                How do I get union keys of `a` and `b` dictionary and 'a' values? [duplicate]
                            
                                How do I update Kivy elements from a thread?
                            
                                What is the best way for a class to reference itself in a class attribute?
                            
                                How to register "atexit" function in python's multiprocessing subprocess?
                            
                                Keep the order of list in sql pagination
                            
                                An Object is created twice in Python
                            
                                Paraview: Changing aspect ratio of axes in rendering window
                            
                                How to get exit code from subprocess.Popen?
                            
                                Python pandas: Add column to grouped DataFrame with method chaining
                            
                                Recursively search for parent child combinations and build tree in python and XML
                            
                                Pycharm code reformatting: align lines by operator
                            
                                Django Prefetch with custom queryset which uses managers method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - Explanation on apply function being slow

Tags:

python

pandas

linpingta

People also ask

1 Answers

andrew

Recent Activity

Donate For Us