For a dataframe with no missing values, this would be as easy as df.diff(periods=24, axis=0)
. But how is it possible to connect the calculations to the index values?
Reproducible dataframe - Code:
# Imports
import pandas as pd
import numpy as np
# A dataframe with two variables, random numbers and hourly time series
np.random.seed(123)
rows = 36
rng = pd.date_range('1/1/2017', periods=rows, freq='H')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['A', 'B'])
df = df.set_index(rng)
Reproducible dataframe - Screenshot:
Desired output - Code:
# Running difference step = 24
df = df.diff(periods=24, axis=0)
df = df.dropna(axis=0, how='all')
Desired output - Screenshot
The real challenge
The problem is that my real-world examples are full of missing values. So I'll have to connect the difference intervals with the index values, and I have no Idea how. I've tried a few solutions with filling in the missing hours in the index first, and then running the differences like before, but it's not very elegant.
Thank you for any suggestions!
Edit - As requested in the comments, here's my best attempt for a bit longer time period:
df_missing = df.drop(df.index[[2,3]])
newIndex = pd.date_range(start = '1/1/2017', end = '1/3/2017', freq='H')
df_missing = df_missing.reindex(newIndex, fill_value = np.nan)
df_refilled = df_missing.diff(periods=24, axis=0)
Compared to the other suggestions, I would say that this is not very elegant =)
I think maybe you can use groupby
df.groupby(df.index.hour).diff().dropna()
Out[784]:
A B
2017-01-02 00:00:00 -3.0 3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00 -4.0 -7.0
2017-01-02 03:00:00 3.0 -29.0
2017-01-02 04:00:00 -4.0 3.0
2017-01-02 05:00:00 -17.0 -6.0
2017-01-02 06:00:00 -20.0 35.0
2017-01-02 07:00:00 -2.0 -40.0
2017-01-02 08:00:00 13.0 -21.0
2017-01-02 09:00:00 -9.0 -13.0
2017-01-02 10:00:00 0.0 3.0
2017-01-02 11:00:00 -21.0 -9.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With