Elegant resample for groups in Pandas

Question

For a given pandas data frame called full_df which looks like

  index   id   timestamp    data  
 ------- ---- ------------ ------ 
      1    1   2017-01-01   10.0  
      2    1   2017-02-01   11.0  
      3    1   2017-04-01   13.0  
      4    2   2017-02-01    1.0  
      5    2   2017-03-01    2.0  
      6    2   2017-05-01    9.0

The start and end dates (and the time delta between start and end) are varying.

But I need a id wise resampled version (added rows marked with *)

  index   id   timestamp    data       
 ------- ---- ------------ ------ ---- 
      1    1   2017-01-01   10.0       
      2    1   2017-02-01   11.0       
      3    1   2017-03-01    NaN   *   
      4    1   2017-04-01   13.0       
      5    2   2017-02-01    1.0       
      6    2   2017-03-01    2.0       
      7    2   2017-04-01    NaN   *   
      8    2   2017-05-01    9.0

Because the dataset is very large I was wondering if there is more efficient way of doing so than

Do full_df.groupby('id')

Do for each group df

df.index = pd.DatetimeIndex(df['timestamp'])
all_days = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(all_days)

Combine all groups again with a new index

That's time consuming and not very elegant. Any ideas?

Zero · Accepted Answer

Using resample

In [1175]: (df.set_index('timestamp').groupby('id').resample('MS').asfreq()
              .drop(['id', 'index'], 1).reset_index())
Out[1175]:
   id  timestamp  data
0   1 2017-01-01  10.0
1   1 2017-02-01  11.0
2   1 2017-03-01   NaN
3   1 2017-04-01  13.0
4   2 2017-02-01   1.0
5   2 2017-03-01   2.0
6   2 2017-04-01   NaN
7   2 2017-05-01   9.0

Details

In [1176]: df
Out[1176]:
   index  id  timestamp  data
0      1   1 2017-01-01  10.0
1      2   1 2017-02-01  11.0
2      3   1 2017-04-01  13.0
3      4   2 2017-02-01   1.0
4      5   2 2017-03-01   2.0
5      6   2 2017-05-01   9.0

In [1177]: df.dtypes
Out[1177]:
index                 int64
id                    int64
timestamp    datetime64[ns]
data                float64
dtype: object

JohnE · Answer

Edit to add: this way does the min/max of dates for full_df, not df. If there wide variation in start/end dates between IDs this will unfortunately inflate the dataframe and @JohnGalt method is better. Nevertheless I'll leave this here as an alternate approach as it ought to be faster than groupby/resample for cases where it is appropriate.

I think the most efficient approach is likely going to be with stack/unstack or melt/pivot.

You could do something like this, for example:

full_df.set_index(['timestamp','id']).unstack('id').stack('id',dropna=False)

               index  data
timestamp  id             
2017-01-01 1     1.0  10.0
           2     NaN   NaN
2017-02-01 1     2.0  11.0
           2     4.0   1.0
2017-03-01 1     NaN   NaN
           2     5.0   2.0
2017-04-01 1     3.0  13.0
           2     NaN   NaN
2017-05-01 1     NaN   NaN
           2     6.0   9.0

Just add reset_index().set_index('id') if you want it to display more like how you have it above. Note in particular the use of dropna=False with stack which preserves the NaN placeholders. Without that, the stack/unstack method just leaves you back where you started.

This method automatically includes the min & max dates, and all dates present for at least one timestamp. If there are interior timestamps missing for everyone, then you need to add a resample like this:

full_df.set_index(['timestamp','id']).unstack('id')\
   .resample('MS').mean()\
   .stack('id',dropna=False)

Elegant resample for groups in Pandas

Tags:

python

pandas

group-by

Michael Dorner

2 Answers

Zero

JohnE

Recent Activity

Donate For Us

Elegant resample for groups in Pandas

Tags:

python

pandas

group-by

Michael Dorner

2 Answers

Zero

JohnE

Related questions

Recent Activity

Donate For Us