pandas

Question

I want to resample a datetime indexed dataframe using a start date, an end date and a 'granularity'

Say I have this dataframe:

                   value
00:00, 01/05/2017    2
12:00, 01/05/2017    4
00:00, 02/05/2017    6
12:00, 02/05/2017    8
00:00, 03/05/2017   10
12:00, 03/05/2017   12

And I want to resample it to go from 06:00, 01/05/2017 to
18:00 02/05/2017 with a 'granularity' of 12 hours (this is the same as the original here for simplicity but doesn't have to be). The result I want is:

                   value
06:00, 01/05/2017    3
18:00, 01/05/2017    5
06:00, 02/05/2017    7
18:00, 02/05/2017    9

Note that the values are the mean of the values they overlap (e.g. 3 = mean(2,4))

I'm unsure how to do this.

My first attempt was:

def resample(df: DataFrame, start: datetime, end: datetime, granularity: timedelta) -> DataFrame:
    result = df.resample(granularity).mean()
    result = result[result.index <= end]
    result = result[result.index >= start]
    return result

This trims the data frame appropriately and ensures the correct granularity but doesn't align the results with the start date so the result is:

                   value
12:00, 01/05/2017    4
00:00, 02/05/2017    6
12:00, 02/05/2017    8

My second attempt used the base parameter to shift the data:

def resample(df: DataFrame, start: datetime, end: datetime, desired_granularity: timedelta) -> DataFrame:
    data_before_start = df[df.index <= start]
    # Get the last index value before our start date
    last_date_before_start = data_before_start.last_valid_index()
    current_granularity_secs = seconds_between_measurements(df)
    rule = str(int(desired_granularity.total_seconds())) + 'S'
    base = current_granularity_secs - (start - last_date_before_start).total_seconds()
    result = df.resample(rule, base=base).mean()
    result = result[result.index < end]
    result = result[result.index >= start]
    return result

This gives me:

                   value
06:00, 01/05/2017    4
18:00, 01/05/2017    6
06:00, 02/05/2017    8
18:00, 02/05/2017    10

This has the right indices but the values are backfilled from the next measurement rather than averaged from the measurements before and after.

Does anyone have any ideas on how I can achieve what I want?

Thanks in advance for your help and just let me know if I've left out any crucial details :)

EDIT: If getting the mean is the bit that makes this very tricky, I could settle for using the value before the given time, similar to pad(). My current 'best' solution gives me the value after, like backfill()

Mathia Haure-Touzé · Accepted Answer

First define your end_start and end_date columns as datetime. Then, you can use .resample two times:

On df.start_date with a forward filling
On df.end_date with a backward filling

Then:

Keep row where start_date < end_date
Concatenate
Apply on each row a function to update start_date and end_date:

Here the code:

df[["start_date","end_date"]] = df[["start_date","end_date"]].astype(np.datetime64)
df1 = df.set_index("start_date").resample(freq).pad().reset_index()
df2 = df.set_index("end_date").resample(freq).bfill().reset_index()
df3 = pd.concat([df1, df2], ignore_index=True)

def function(x, df1):
    if x.name < df1.shape[0]:
        x.end_date = x.start_date + pd.Timedelta(freq)
    else:
        x.start_date = x.end_date - pd.Timedelta(freq)
    return x

df3[ df3.start_date < df3.end_date ].apply(lambda x: function(x, df1), axis=1)

Pandas documentation say that it should be possible directly to resample

df.resample(freq, on='start_date')

pandas - Resample a dataframe using a specified start date, end date and granularity

Tags:

python

duck

1 Answers

Mathia Haure-Touzé

Recent Activity

Donate For Us