Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - Resample a dataframe using a specified start date, end date and granularity

Tags:

python

pandas

I want to resample a datetime indexed dataframe using a start date, an end date and a 'granularity'

Say I have this dataframe:

                   value
00:00, 01/05/2017    2
12:00, 01/05/2017    4
00:00, 02/05/2017    6
12:00, 02/05/2017    8
00:00, 03/05/2017   10
12:00, 03/05/2017   12

And I want to resample it to go from 06:00, 01/05/2017 to
18:00 02/05/2017 with a 'granularity' of 12 hours (this is the same as the original here for simplicity but doesn't have to be). The result I want is:

                   value
06:00, 01/05/2017    3
18:00, 01/05/2017    5
06:00, 02/05/2017    7
18:00, 02/05/2017    9

Note that the values are the mean of the values they overlap (e.g. 3 = mean(2,4))

I'm unsure how to do this.

My first attempt was:

def resample(df: DataFrame, start: datetime, end: datetime, granularity: timedelta) -> DataFrame:
    result = df.resample(granularity).mean()
    result = result[result.index <= end]
    result = result[result.index >= start]
    return result

This trims the data frame appropriately and ensures the correct granularity but doesn't align the results with the start date so the result is:

                   value
12:00, 01/05/2017    4
00:00, 02/05/2017    6
12:00, 02/05/2017    8

My second attempt used the base parameter to shift the data:

def resample(df: DataFrame, start: datetime, end: datetime, desired_granularity: timedelta) -> DataFrame:
    data_before_start = df[df.index <= start]
    # Get the last index value before our start date
    last_date_before_start = data_before_start.last_valid_index()
    current_granularity_secs = seconds_between_measurements(df)
    rule = str(int(desired_granularity.total_seconds())) + 'S'
    base = current_granularity_secs - (start - last_date_before_start).total_seconds()
    result = df.resample(rule, base=base).mean()
    result = result[result.index < end]
    result = result[result.index >= start]
    return result

This gives me:

                   value
06:00, 01/05/2017    4
18:00, 01/05/2017    6
06:00, 02/05/2017    8
18:00, 02/05/2017    10

This has the right indices but the values are backfilled from the next measurement rather than averaged from the measurements before and after.

Does anyone have any ideas on how I can achieve what I want?

Thanks in advance for your help and just let me know if I've left out any crucial details :)

EDIT: If getting the mean is the bit that makes this very tricky, I could settle for using the value before the given time, similar to pad(). My current 'best' solution gives me the value after, like backfill()

like image 772
duck Avatar asked Nov 07 '22 22:11

duck


1 Answers

First define your end_start and end_date columns as datetime. Then, you can use .resample two times:

  • On df.start_date with a forward filling
  • On df.end_date with a backward filling

Then:

  • Keep row where start_date < end_date
  • Concatenate
  • Apply on each row a function to update start_date and end_date:

Here the code:

df[["start_date","end_date"]] = df[["start_date","end_date"]].astype(np.datetime64)
df1 = df.set_index("start_date").resample(freq).pad().reset_index()
df2 = df.set_index("end_date").resample(freq).bfill().reset_index()
df3 = pd.concat([df1, df2], ignore_index=True)

def function(x, df1):
    if x.name < df1.shape[0]:
        x.end_date = x.start_date + pd.Timedelta(freq)
    else:
        x.start_date = x.end_date - pd.Timedelta(freq)
    return x

df3[ df3.start_date < df3.end_date ].apply(lambda x: function(x, df1), axis=1)

Pandas documentation say that it should be possible directly to resample

df.resample(freq, on='start_date')

like image 162
Mathia Haure-Touzé Avatar answered Nov 14 '22 21:11

Mathia Haure-Touzé