I'm having trouble with the performance of resample function in combination with a groupby. The operation I'm doing is currently taking 8+ seconds on a data sample of 5000 rows which is totally unreasonable for my requirements.
Pastebin with data as dict: https://pastebin.com/RPNdhXsy
I have data with dates in a quarterly interval which I want to group by a column and then resample the dates within the groups on a monthly basis.
Input:
isin report_date val
SE001 2018-12-31 1
SE001 2018-09-30 2
SE001 2018-06-31 3
US001 2018-10-31 4
US001 2018-07-31 5
Output:
isin report_date val
SE001 2018-12-31 1
2018-11-30 NaN
2018-10-31 NaN
2018-09-30 2
2018-08-31 NaN
2018-07-31 NaN
2018-06-30 3
US001 2018-10-30 4
2018-09-31 NaN
2018-08-31 NaN
2018-07-31 5
I used to have this operation:
df.groupby('isin').resample('M', on="report_date").first()[::-1]
Since it seems that asfreq()
has slightly better performance than using on=
in resample
, I instead do the following currently. It's still slow though.
I reverse since resample
seems to non-optionally sort dates descending.
df.set_index('report_date').groupby('isin').resample('M').asfreq()[::-1]
As stated, with 5000 rows and around 16 columns this takes 15 seconds to run since I need to do it on two separate dataframes. With the sample data in the pastebin (500 rows) the operation takes me 0.7s which is way too long for me since my final data will have 800k rows.
Current way
setindex --- 0.001055002212524414 seconds ---
groupby --- 0.00033092498779296875 seconds ---
resample --- 0.004662036895751953 seconds ---
asfreq --- 0.8990700244903564 seconds ---
[::-1] --- 0.0013098716735839844 seconds ---
= 0.9056s
Old way
groupby --- 0.0005779266357421875 seconds ---
resample --- 0.0044629573822021484 seconds ---
first --- 1.6829369068145752 seconds ---
[::-1] --- 0.001600027084350586 seconds ---
= 1.6894s
Judging by this, it seems that converting from the pandas.core.resample.DatetimeIndexResamplerGroupby
to a df is taking very long. Now what?
df.set_index('report_date').groupby('isin').apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='M'), fill_value=0))[::-1]
This takes 0.28s which is a vast improvement. Still not very good though.
How can I speed this up? Is there another way to do the same thing?
I cut execution time for a 25k row test data set from 850 ms to 320 ms. I wrapped the reindex logic in a function, to make timing easier:
def orig_pipeline(df):
return (df
.set_index('report_date')
.groupby('isin')
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(),
freq='M'),
fill_value=0))
[::-1])
Then, I created new functions to make date arithmetic and reindexing faster:
def create_params(df):
return (df.groupby('isin')['report_date']
.agg(['min', 'max']).sort_index().reset_index())
def create_multiindex(df, params):
all_dates = pd.date_range(start='1999-12-31', end='2020-12-31', freq='M')
midx = (
(row.isin, d)
for row in params.itertuples()
for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
return pd.MultiIndex.from_tuples(midx, names=['isin', 'report_date'])
def apply_mulitindex(df, midx):
return df.set_index(['isin', 'report_date']).reindex(midx)
def new_pipeline(df):
params = create_params(df)
midx = create_multiindex(df, params)
return apply_mulitindex(df, midx)
Old and new pipelines give same results (except possibly sort order):
v1 = orig_pipeline(df).drop(columns='isin').sort_index()
v2 = new_pipeline(df).sort_index().fillna(0)
assert(v1 == v2).all().all()
Timing results:
%%timeit
v1 = orig_pipeline(df_big)
854 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
v2 = new_pipeline(df_big)
322 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I would like to illustrate the experiments I made trying to figure which solution yields the most performance, and it shows that @jsmart 's is the best one.
My dataset is like the following (sorry for the screenshot I could not manage to paste a pretty table):
My goal is to have for each (orgacom, client) couple the indicators resampled by business day.
Solution 1: groupby / apply asfreq
%%time
sol1 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
.apply(lambda x: x.asfreq('B', fill_value=np.nan))
)
CPU times: user 4min 6s, sys: 2.91 s, total: 4min 9s Wall time: 4min 9s
Solution 2: groupby / apply reindex (as of @jokab EDIT2)
%%time
sol2 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='B'), fill_value=np.nan))
)
CPU times: user 4min 13s, sys: 2.16 s, total: 4min 15s Wall time: 4min 15s
Solution 3: recoding resample (as of @jsmart answer)
def create_params(df):
return (df.reset_index().groupby(['orgacom', 'client'], observed=True, )['date']
.agg(['min', 'max']).sort_index().reset_index())
def create_multiindex(df, params):
all_dates = pd.date_range(start='2016-12-31', end='2020-12-31', freq='B')
midx = (
(row.orgacom, row.client, d)
for row in params.itertuples()
for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
return pd.MultiIndex.from_tuples(midx, names=['orgacom', 'client', 'date'])
def apply_mulitindex(df, midx):
return df.set_index(['orgacom', 'client', 'date']).reindex(midx)
def new_pipeline(df):
params = create_params(df)
midx = create_multiindex(df, params)
return apply_mulitindex(df, midx)
%%time
sol3 = new_pipeline(to_process.reset_index())
CPU times: user 1min 46s, sys: 4.93 s, total: 1min 51s Wall time: 1min 51s
Solution 4: groupby / resample asfreq (as of @jokab first solution)
%%time
sol4 = to_process.groupby(['orgacom', 'client']).resample('B').asfreq()
CPU times: user 4min 22s, sys: 8.01 s, total: 4min 30s Wall time: 4min 30s
I also noticed that resampling on groupby can be slow. In my case, I used data reshaping for speed up,
df.set_index(['isin', 'report_date'])['val'].unstack(0).resample('M')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With