I want to resample a datetime indexed dataframe using a start date, an end date and a 'granularity'
Say I have this dataframe:
value
00:00, 01/05/2017 2
12:00, 01/05/2017 4
00:00, 02/05/2017 6
12:00, 02/05/2017 8
00:00, 03/05/2017 10
12:00, 03/05/2017 12
And I want to resample it to go from 06:00, 01/05/2017
to18:00 02/05/2017
with a 'granularity' of 12 hours (this is the same as the original here for simplicity but doesn't have to be). The result I want is:
value
06:00, 01/05/2017 3
18:00, 01/05/2017 5
06:00, 02/05/2017 7
18:00, 02/05/2017 9
Note that the values are the mean of the values they overlap (e.g. 3 = mean(2,4))
I'm unsure how to do this.
My first attempt was:
def resample(df: DataFrame, start: datetime, end: datetime, granularity: timedelta) -> DataFrame:
result = df.resample(granularity).mean()
result = result[result.index <= end]
result = result[result.index >= start]
return result
This trims the data frame appropriately and ensures the correct granularity but doesn't align the results with the start date so the result is:
value
12:00, 01/05/2017 4
00:00, 02/05/2017 6
12:00, 02/05/2017 8
My second attempt used the base
parameter to shift the data:
def resample(df: DataFrame, start: datetime, end: datetime, desired_granularity: timedelta) -> DataFrame:
data_before_start = df[df.index <= start]
# Get the last index value before our start date
last_date_before_start = data_before_start.last_valid_index()
current_granularity_secs = seconds_between_measurements(df)
rule = str(int(desired_granularity.total_seconds())) + 'S'
base = current_granularity_secs - (start - last_date_before_start).total_seconds()
result = df.resample(rule, base=base).mean()
result = result[result.index < end]
result = result[result.index >= start]
return result
This gives me:
value
06:00, 01/05/2017 4
18:00, 01/05/2017 6
06:00, 02/05/2017 8
18:00, 02/05/2017 10
This has the right indices but the values are backfilled from the next measurement rather than averaged from the measurements before and after.
Does anyone have any ideas on how I can achieve what I want?
Thanks in advance for your help and just let me know if I've left out any crucial details :)
EDIT: If getting the mean is the bit that makes this very tricky, I could settle for using the value before the given time, similar to pad(). My current 'best' solution gives me the value after, like backfill()
First define your end_start and end_date columns as datetime.
Then, you can use .resample
two times:
Then:
Here the code:
df[["start_date","end_date"]] = df[["start_date","end_date"]].astype(np.datetime64)
df1 = df.set_index("start_date").resample(freq).pad().reset_index()
df2 = df.set_index("end_date").resample(freq).bfill().reset_index()
df3 = pd.concat([df1, df2], ignore_index=True)
def function(x, df1):
if x.name < df1.shape[0]:
x.end_date = x.start_date + pd.Timedelta(freq)
else:
x.start_date = x.end_date - pd.Timedelta(freq)
return x
df3[ df3.start_date < df3.end_date ].apply(lambda x: function(x, df1), axis=1)
Pandas documentation say that it should be possible directly to resample
df.resample(freq, on='start_date')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With