Apply function seems to work very slow with a large dataframe (about 1~3 million rows).
I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)
For my case, I have two kinds of tasks to do with the apply function.
First: apply with lookup dict query
f(p_id, p_dict):
return p_dict[p_dict['ID'] == p_id]['value']
p_dict = DataFrame(...) # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))
Second: apply with groupby
f(week_id, min_week_num, p_dict):
return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()
f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])
I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.
My question is:
You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits. See how we reduced the size of our dataframe from 38MB to 9.5MB.
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
Apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index ( axis=0 ) or the DataFrame's columns ( axis=1 ). By default ( result_type=None ), the final return type is inferred from the return type of the applied function.
Series Map: This is actually somewhat faster than Series Apply, but still relatively slow.
Concerning your first question, I can't say exactly why this instance is slow. But generally, apply
does not take advantage of vectorization. Also, apply
returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).
For your first method, I assume you are trying to fill a 'value' column in df
using the p_dict
as a lookup table. It is about 1000x faster to use pd.merge
:
import string, sys
import numpy as np
import pandas as pd
##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]
# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
# Apply the f1 method as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop
# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)
>>> 1000 loops, best of 5: 826 µs per loop
Concerning the second task, we can quickly add a new column to p_dict
that calculates a mean where the time window starts at min_week_num
and ends at the week for that row in p_dict
. This requires that p_dict
is sorted by ascending order along the WEEK
column. Then you can use pd.merge
again.
I am assuming that min_week_num
is 0 in the following example. But you could easily modify rolling_growing_mean
to take a different value. The rolling_growing_mean
method will run in O(n) since it conducts a fixed number of operations per iteration.
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})
def rolling_growing_mean(values):
out = np.empty(len(values))
out[0] = values[0]
# Time window for taking mean grows each step
for i, v in enumerate(values[1:]):
out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
return out
p_dict['Means'] = rolling_growing_mean(p_dict['value'])
df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With