I have a time-series in a Pandas DataFrame. The timestamps can be uneven (one every 1-5 minutes), but there will always be one every 5 minutes (timestamps with minutes ending in 0,5,10,15,20,25,30,35,40,45,50,55).
Example:
2017-01-01 2:05:00 32.90
2017-01-01 2:07:30 29.83
2017-01-01 2:10:00 45.76
2017-01-01 2:15:00 16.22
2017-01-01 2:20:00 17.33
2017-01-01 2:25:00 23.40
2017-01-01 2:28:45 150.12
2017-01-01 2:30:00 100.29
2017-01-01 2:35:00 38.45
2017-01-01 2:40:00 67.12
2017-01-01 2:45:00 20.00
2017-01-01 2:50:00 58.41
2017-01-01 2:55:00 58.32
2017-01-01 3:00:00 59.89
I want to get the time-weighted averages of blocks of 15 minutes. The rows with a time stamp that is directly on a 15 minute mark (timestamps with minutes ending in 0,15,30,45) end an interval, so the grouping is as follows:
Group 1 (interval 2017-01-01 2:00:00):
2017-01-01 2:05:00 32.90
2017-01-01 2:07:30 29.83
2017-01-01 2:10:00 45.76
2017-01-01 2:15:00 16.22
Group 2 (interval 2017-01-01 2:15:00):
2017-01-01 2:20:00 17.33
2017-01-01 2:25:00 23.40
2017-01-01 2:28:45 150.12
2017-01-01 2:30:00 100.29
Group 3 (interval 2017-01-01 2:30:00):
2017-01-01 2:35:00 38.45
2017-01-01 2:40:00 67.12
2017-01-01 2:45:00 20.00
Group 4 (interval 2017-01-01 2:45:00):
2017-01-01 2:50:00 58.41
2017-01-01 2:55:00 58.32
2017-01-01 3:00:00 59.89
The average MUST be time-weighted, so not just a standard average of all values in a group.
For instance, the time-weighted average of group 2 is NOT 72.785, which is the regular average of all 4 values. Rather, it should be:
(5 minutes / 15 minutes) * 17.33 = 5.776667 ==> The 5 minutes is taken from the difference between this timestamp and the previous timestamp
+(5 minutes / 15 minutes) * 23.40 = 7.8
+(3.75 minutes / 15 minutes) * 150.12 = 37.53
+(1.25 minutes / 15 minutes) * 100.29 = 8.3575
= **59.46417**
Also ideally, the 15 minutes is parameterized, as this might change in the future to 60 minutes (hourly), but I don't think this is an issue here.
Also, performance is quite important in this. As my dataset will have around 10k rows, so iterating over each record one by one will be pretty slow.
I tried looking into the df.rolling() function of Pandas, but couldn't figure out how to apply it directly to my specific scenario.
Thanks a lot for the help!
UPDATE 1:
Following Simon's brilliant solution, I modified it a little bit.
I made a few tweaks to it to adapt it to my specific case:
def func(df):
if df.size == 0: return
timestep = 15*60
indexes = df.index - (df.index[-1] - pd.Timedelta(seconds=timestep))
seconds = indexes.seconds
weight = [seconds[n]/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
for n, k in enumerate(seconds)]
return np.sum(weight*df.values)
This is to cope with possibly empty 15 minute intervals (missing rows in the DB)
This one was tricky. I would love to see another commenter do this more efficiently, since I have a hunch there is a better way to do this.
I also skipped one part, which is parameterizing the 15 minute value, but I point out how you might do it in the comment. This is left as an exercise for the reader :D It should be parameterized though, as it stands now there are a lot of random '*15' and '*60' values scattered around the place, which looks clumsy.
I'm also tired, and my wife wants to watch a movie, so I didn't clean up my code. It's a little messy, and should be written cleaner -- which may or may not be worth doing, depending on if someone else can redo this all in 6 lines of code. If tomorrow morning it is still unanswered, I'll go back through and do this better.
def func(df):
timestep = 15*60
seconds = (df.index.minute*60+df.index.second)-timestep
weight = [k/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
for n, k in enumerate(seconds)]
return np.sum(weight*df.values)
df.resample('15min', closed='right').apply(func)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With