Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tricky slicing specifications on business-day datetimeindex

Tags:

python

pandas

I have a pandas dataframe with a business-day-based DateTimeIndex. For each month that's in the index, I also have a single 'marker' day specified.

Here's a toy version of that dataframe:

# a dataframe with business dates as the index
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()

# each month has an single, arbitrary marker day specified
marker_dates = [df.index[12], df.index[33], df.index[57]]

For each month in the index, I need to calculate average of the foo column in specific slice of rows in that month.

There are two different ways I need to be able to specify those slices:

1) m'th day to n'th day.

Example might be (2rd to 4th business day in that month). So april would be the average of 1 (apr2), 4 (apr3), and 5 (apr 6) = 3.33. May would be 33 (may 4), 34 (may 5), 35 (may 6) = 34. I don't consider the weekends/holidays that don't occur in the index as days.

2) m'th day before/after the marker date to the n'th day before/after the marker date.

Example might be "average of the slice from 1 day before the marker date to 1 day after the marker date in each month" Eg. In April, the marker date is 17Apr. Looking at the index, we want the average of apr16, apr17, and apr20.

For Example 1, I had an ugly solution that foreach month I would slice the rows of that month away, and then apply df_slice.iloc[m:n].mean()

Whenever I start doing iterative things with pandas, I always suspect I'm doing it wrong. So I imagine there is a cleaner, pythonic/vectorized way to make this result for all the months

For Example 2, I don't not know a good way to do this slice-averaging based on arbitrary dates across many months.

like image 884
user3556757 Avatar asked May 25 '18 11:05

user3556757


People also ask

How to handle time series data in pandas?

Using pandas datetime properties. Initially, the values in datetime are character strings and do not provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. datetime64[ns, UTC] ) objects.

What is a DataFrame slice?

Slice. A set of a data frame's rows you create by using a method from a data frame instance or another data frame slice.


2 Answers

Use BDay() from pandas.tseries.offsets

import pandas as pd
from pandas.tseries.offsets import BDay 

M=2
N=4

start_date = pd.datetime(2015,4,1)
end_date = pd.datetime(2015,6,30)

df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()

# for month starts
marker_dates = pd.date_range(start=start_date, end=end_date, freq='BMS')

# create IntervalIndex
bins = pd.IntervalIndex.from_tuples([ (d + (M-1)*BDay(), d + (N-1)*BDay()) for d in marker_dates ], closed='both')

df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-02, 2015-04-06]   3.333333
#[2015-05-04, 2015-05-06]  34.000000
#[2015-06-02, 2015-06-04]  63.000000


# any markers
marker_dates = [df.index[12], df.index[33], df.index[57]]

# M Bday before, and N Bday after 
bins = pd.IntervalIndex.from_tuples([ (d - M*BDay(), d + N*BDay()) for d in marker_dates ], closed='both')

df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-15, 2015-04-23]  18.428571
#[2015-05-14, 2015-05-22]  48.000000
#[2015-06-17, 2015-06-25]  81.428571
like image 54
jxc Avatar answered Sep 28 '22 08:09

jxc


The most pythonic/vectorized (pandonic?) way to do this might be to use df.rolling and df.shift to generate the window over which you'll take the average, then df.reindex to select the value at the dates you've marked.

For your example (2), this could look like:

df['foo'].rolling(3).mean().shift(-1).reindex(marker_dates)
Out[8]: 
2015-04-17    17.333333
2015-05-18    47.000000
2015-06-19    80.333333
Name: foo, dtype: float64

This could be wrapped in a small function:

def window_mean_at_indices(df, indices, begin=-1, end=1):
    return df.rolling(1+end-begin).mean().shift(-end).reindex(indices)

Helping to make it more clear how to apply this to situation (1):

month_starts = pd.date_range(df.index.min(), df.index.max(), freq='BMS')

month_starts
Out[11]: DatetimeIndex(['2015-04-01', '2015-05-01', '2015-06-01'],
                       dtype='datetime64[ns]', freq='BMS')

window_mean_at_indices(df['foo'], month_starts, begin=1, end=3)
Out[12]: 
2015-04-01     3.333333
2015-05-01    34.000000
2015-06-01    63.000000
Freq: BMS, Name: foo, dtype: float64
like image 28
EFT Avatar answered Sep 28 '22 09:09

EFT