I'm having troubles with grouping a pandas df by time range and different calculations by column:
Let's take following df:
date identifier value_1 value_2
0 05.07.2018 16:35 A 10 0
1 05.07.2018 16:36 B 20 1
2 05.07.2018 16:37 A 20 2
3 05.07.2018 16:39 B 30 1
4 05.07.2018 16:40 A 40 3
5 05.07.2018 16:41 B 20 2
6 05.07.2018 16:41 A 30 1
7 05.07.2018 16:42 B 50 2
8 05.07.2018 16:43 B 20 3
9 05.07.2018 16:44 A 20 1
As a result I need a df, which is grouped by time in 5 mins intervals and by identifier, with the average of value_1 and the sum of value_2:
date identifier value_1 value_2
0 05.07.2018 16:35 A 15 2
1 05.07.2018 16:35 B 25 2
2 05.07.2018 16:40 A 30 5
3 05.07.2018 16:40 B 30 7
How can I do this the most efficient way in pandas?
THX & BR from Vienna
you can use groupby
, pd.Grouper
, and agg
, after setting your date
column to datetime
with the proper format:
# Set date to datetime format. I'm assuming it's day.month.year in your original dataframe
df['date'] = pd.to_datetime(df.date, format = '%d.%m.%Y %H:%M')
new_df = (df.groupby(['identifier', pd.Grouper(key='date', freq='5min')])
.agg({'value_1':'mean', 'value_2':'sum'}))
>>> new_df
value_1 value_2
identifier date
A 2018-07-05 16:35:00 15 2
2018-07-05 16:40:00 30 5
B 2018-07-05 16:35:00 25 2
2018-07-05 16:40:00 30 7
If you want the same format as your desired output in your post, you can use this to sort:
new_df.reset_index().sort_values(['date','identifier'])
identifier date value_1 value_2
0 A 2018-07-05 16:35:00 15 2
2 B 2018-07-05 16:35:00 25 2
1 A 2018-07-05 16:40:00 30 5
3 B 2018-07-05 16:40:00 30 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With