For a given pandas data frame called full_df
which looks like
index id timestamp data
------- ---- ------------ ------
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-05-01 9.0
The start and end dates (and the time delta between start and end) are varying.
But I need a id wise resampled version (added rows marked with *
)
index id timestamp data
------- ---- ------------ ------ ----
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-03-01 NaN *
4 1 2017-04-01 13.0
5 2 2017-02-01 1.0
6 2 2017-03-01 2.0
7 2 2017-04-01 NaN *
8 2 2017-05-01 9.0
Because the dataset is very large I was wondering if there is more efficient way of doing so than
full_df.groupby('id')
Do for each group df
df.index = pd.DatetimeIndex(df['timestamp'])
all_days = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(all_days)
Combine all groups again with a new index
That's time consuming and not very elegant. Any ideas?
Using resample
In [1175]: (df.set_index('timestamp').groupby('id').resample('MS').asfreq()
.drop(['id', 'index'], 1).reset_index())
Out[1175]:
id timestamp data
0 1 2017-01-01 10.0
1 1 2017-02-01 11.0
2 1 2017-03-01 NaN
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-04-01 NaN
7 2 2017-05-01 9.0
Details
In [1176]: df
Out[1176]:
index id timestamp data
0 1 1 2017-01-01 10.0
1 2 1 2017-02-01 11.0
2 3 1 2017-04-01 13.0
3 4 2 2017-02-01 1.0
4 5 2 2017-03-01 2.0
5 6 2 2017-05-01 9.0
In [1177]: df.dtypes
Out[1177]:
index int64
id int64
timestamp datetime64[ns]
data float64
dtype: object
Edit to add: this way does the min/max of dates for full_df
, not df
. If there wide variation in start/end dates between IDs this will unfortunately inflate the dataframe and @JohnGalt method is better. Nevertheless I'll leave this here as an alternate approach as it ought to be faster than groupby/resample for cases where it is appropriate.
I think the most efficient approach is likely going to be with stack/unstack or melt/pivot.
You could do something like this, for example:
full_df.set_index(['timestamp','id']).unstack('id').stack('id',dropna=False)
index data
timestamp id
2017-01-01 1 1.0 10.0
2 NaN NaN
2017-02-01 1 2.0 11.0
2 4.0 1.0
2017-03-01 1 NaN NaN
2 5.0 2.0
2017-04-01 1 3.0 13.0
2 NaN NaN
2017-05-01 1 NaN NaN
2 6.0 9.0
Just add reset_index().set_index('id')
if you want it to display more like how you have it above. Note in particular the use of dropna=False
with stack which preserves the NaN placeholders. Without that, the stack/unstack method just leaves you back where you started.
This method automatically includes the min & max dates, and all dates present for at least one timestamp. If there are interior timestamps missing for everyone, then you need to add a resample
like this:
full_df.set_index(['timestamp','id']).unstack('id')\
.resample('MS').mean()\
.stack('id',dropna=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With