Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby resample poor performance

My problem

I'm having trouble with the performance of resample function in combination with a groupby. The operation I'm doing is currently taking 8+ seconds on a data sample of 5000 rows which is totally unreasonable for my requirements.

Sample data (500 rows)

Pastebin with data as dict: https://pastebin.com/RPNdhXsy


The logic

I have data with dates in a quarterly interval which I want to group by a column and then resample the dates within the groups on a monthly basis.

Input:
     isin  report_date   val
    SE001   2018-12-31     1
    SE001   2018-09-30     2
    SE001   2018-06-31     3
    US001   2018-10-31     4
    US001   2018-07-31     5

Output:
    isin   report_date      val        
    SE001   2018-12-31        1
            2018-11-30      NaN
            2018-10-31      NaN
            2018-09-30        2
            2018-08-31      NaN
            2018-07-31      NaN
            2018-06-30        3
    US001   2018-10-30        4    
            2018-09-31      NaN
            2018-08-31      NaN
            2018-07-31        5

I used to have this operation:

df.groupby('isin').resample('M', on="report_date").first()[::-1]

Since it seems that asfreq() has slightly better performance than using on= in resample, I instead do the following currently. It's still slow though. I reverse since resample seems to non-optionally sort dates descending.

df.set_index('report_date').groupby('isin').resample('M').asfreq()[::-1]

As stated, with 5000 rows and around 16 columns this takes 15 seconds to run since I need to do it on two separate dataframes. With the sample data in the pastebin (500 rows) the operation takes me 0.7s which is way too long for me since my final data will have 800k rows.

EDIT: Timing of the different operations

Current way

setindex --- 0.001055002212524414 seconds ---
groupby --- 0.00033092498779296875 seconds ---
resample --- 0.004662036895751953 seconds ---
asfreq --- 0.8990700244903564 seconds ---
[::-1] --- 0.0013098716735839844 seconds ---
= 0.9056s

Old way

groupby --- 0.0005779266357421875 seconds ---
resample --- 0.0044629573822021484 seconds ---
first --- 1.6829369068145752 seconds ---
[::-1] --- 0.001600027084350586 seconds ---
= 1.6894s

Judging by this, it seems that converting from the pandas.core.resample.DatetimeIndexResamplerGroupby to a df is taking very long. Now what?

EDIT2: Using reindex

df.set_index('report_date').groupby('isin').apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='M'), fill_value=0))[::-1]

This takes 0.28s which is a vast improvement. Still not very good though.


How can I speed this up? Is there another way to do the same thing?

like image 844
Jokab Avatar asked Aug 19 '19 09:08

Jokab


3 Answers

I cut execution time for a 25k row test data set from 850 ms to 320 ms. I wrapped the reindex logic in a function, to make timing easier:

def orig_pipeline(df):
    return (df
            .set_index('report_date')
            .groupby('isin')
            .apply(lambda x: x.reindex(pd.date_range(x.index.min(), 
                                                     x.index.max(), 
                                                     freq='M'), 
                                       fill_value=0))
            [::-1])

Then, I created new functions to make date arithmetic and reindexing faster:

def create_params(df):
    return (df.groupby('isin')['report_date']
            .agg(['min', 'max']).sort_index().reset_index())

def create_multiindex(df, params):
    all_dates = pd.date_range(start='1999-12-31', end='2020-12-31', freq='M')
    midx = (
        (row.isin, d)
        for row in params.itertuples()
        for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
    return pd.MultiIndex.from_tuples(midx, names=['isin', 'report_date'])

def apply_mulitindex(df, midx):
    return df.set_index(['isin', 'report_date']).reindex(midx)

def new_pipeline(df):
    params = create_params(df)
    midx = create_multiindex(df, params)
    return apply_mulitindex(df, midx)

Old and new pipelines give same results (except possibly sort order):

v1 = orig_pipeline(df).drop(columns='isin').sort_index()
v2 = new_pipeline(df).sort_index().fillna(0)
assert(v1 == v2).all().all()

Timing results:

%%timeit
v1 = orig_pipeline(df_big)
854 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
v2 = new_pipeline(df_big)
322 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 81
jsmart Avatar answered Nov 20 '22 20:11

jsmart


I would like to illustrate the experiments I made trying to figure which solution yields the most performance, and it shows that @jsmart 's is the best one.

My dataset is like the following (sorry for the screenshot I could not manage to paste a pretty table):

Dataframe to process

My goal is to have for each (orgacom, client) couple the indicators resampled by business day.

Solution 1: groupby / apply asfreq

%%time
sol1 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
          .apply(lambda x: x.asfreq('B', fill_value=np.nan))
)

CPU times: user 4min 6s, sys: 2.91 s, total: 4min 9s Wall time: 4min 9s

Solution 2: groupby / apply reindex (as of @jokab EDIT2)

%%time
sol2 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
    .apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='B'), fill_value=np.nan))
)

CPU times: user 4min 13s, sys: 2.16 s, total: 4min 15s Wall time: 4min 15s

Solution 3: recoding resample (as of @jsmart answer)

def create_params(df):
    return (df.reset_index().groupby(['orgacom', 'client'], observed=True, )['date']
            .agg(['min', 'max']).sort_index().reset_index())

def create_multiindex(df, params):
    all_dates = pd.date_range(start='2016-12-31', end='2020-12-31', freq='B')
    midx = (
        (row.orgacom, row.client, d)
        for row in params.itertuples()
        for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
    return pd.MultiIndex.from_tuples(midx, names=['orgacom', 'client', 'date'])

def apply_mulitindex(df, midx):
    return df.set_index(['orgacom', 'client', 'date']).reindex(midx)

def new_pipeline(df):
    params = create_params(df)
    midx = create_multiindex(df, params)
    return apply_mulitindex(df, midx)

%%time
sol3 = new_pipeline(to_process.reset_index())

CPU times: user 1min 46s, sys: 4.93 s, total: 1min 51s Wall time: 1min 51s

Solution 4: groupby / resample asfreq (as of @jokab first solution)

%%time
sol4 = to_process.groupby(['orgacom', 'client']).resample('B').asfreq()

CPU times: user 4min 22s, sys: 8.01 s, total: 4min 30s Wall time: 4min 30s

like image 23
Pierre Massé Avatar answered Nov 20 '22 19:11

Pierre Massé


I also noticed that resampling on groupby can be slow. In my case, I used data reshaping for speed up,

df.set_index(['isin', 'report_date'])['val'].unstack(0).resample('M')
like image 1
Mark Wang Avatar answered Nov 20 '22 18:11

Mark Wang