I am trying to calculate time based aggregations in Pandas based on date values stored in a separate tables.
The top of the first table table_a looks like this:
COMPANY_ID DATE MEASURE 1 2010-01-01 00:00:00 10 1 2010-01-02 00:00:00 10 1 2010-01-03 00:00:00 10 1 2010-01-04 00:00:00 10 1 2010-01-05 00:00:00 10
Here is the code to create the table:
table_a = pd.concat(\ [pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\ 'COMPANY_ID': 1 , 'MEASURE': 10}),\ pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\ 'COMPANY_ID': 2 , 'MEASURE': 10})])
The second table, table_b looks like this:
COMPANY END_DATE 1 2010-03-01 00:00:00 1 2010-06-02 00:00:00 2 2010-03-01 00:00:00 2 2010-06-02 00:00:00
and the code to create it is:
table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\ 'COMPANY':(1,1,2,2)})
I want to be able to get the sum of the measure column for each COMPANY_ID for each 30 day period prior to the END_DATE in table_b.
This is (I think) the SQL equivalent:
select b.COMPANY_ID, b.DATE sum(a.MEASURE) AS MEASURE_TO_END_DATE from table_a a, table_b b where a.COMPANY = b.COMPANY and a.DATE < b.DATE and a.DATE > b.DATE - 30 group by b.COMPANY;
Thanks for any help
Boolean Series in PandasThe between() function is used to get boolean Series equivalent to left <= series <= right. This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
In Pandas, there are parameters to perform left, right, inner or outer merge and join on two DataFrames or Series. However there's no possibility as of now to perform a cross join to merge or join two methods using how="cross" parameter. # merge on that key.
Use DataFrame. apply() to Apply the if-else Condition in a Pandas DataFrame in Python. The apply() method uses the data frame's axis (row or column) to apply a function. We can make our defined function that consists of if-else conditions and apply it to the Pandas dataframe.
Well, I can think of a few ways:
company
)... then filter on the 30-day windows after the merge. groupby()
.Option #1
Suppose your data looks like the following (I expanded your sample data):
print df company date measure 0 0 2010-01-01 10 1 0 2010-01-15 10 2 0 2010-02-01 10 3 0 2010-02-15 10 4 0 2010-03-01 10 5 0 2010-03-15 10 6 0 2010-04-01 10 7 1 2010-03-01 5 8 1 2010-03-15 5 9 1 2010-04-01 5 10 1 2010-04-15 5 11 1 2010-05-01 5 12 1 2010-05-15 5 print windows company end_date 0 0 2010-02-01 1 0 2010-03-15 2 1 2010-04-01 3 1 2010-05-15
Create a beginning date for the 30 day windows:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) print windows company end_date beg_date 0 0 2010-02-01 2010-01-02 1 0 2010-03-15 2010-02-13 2 1 2010-04-01 2010-03-02 3 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date
falls within beg_date
and end_date
:
df = df.merge(windows,on='company',how='left') df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df company date measure end_date beg_date 2 0 2010-01-15 10 2010-02-01 2010-01-02 4 0 2010-02-01 10 2010-02-01 2010-01-02 7 0 2010-02-15 10 2010-03-15 2010-02-13 9 0 2010-03-01 10 2010-03-15 2010-02-13 11 0 2010-03-15 10 2010-03-15 2010-02-13 16 1 2010-03-15 5 2010-04-01 2010-03-02 18 1 2010-04-01 5 2010-04-01 2010-03-02 21 1 2010-04-15 5 2010-05-15 2010-04-15 23 1 2010-05-01 5 2010-05-15 2010-04-15 25 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company
and end_date
:
print df.groupby(['company','end_date']).sum() measure company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15
Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) def cond_merge(g,windows): g = g.merge(windows,on='company',how='left') g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)] return g.groupby('end_date')['measure'].sum() print df.groupby('company').apply(cond_merge,windows) company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15
Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:
windows['date'] = windows['end_date'] df = df.merge(windows,on=['company','date'],how='outer') print df company date measure end_date 0 0 2010-01-01 10 NaT 1 0 2010-01-15 10 NaT 2 0 2010-02-01 10 2010-02-01 3 0 2010-02-15 10 NaT 4 0 2010-03-01 10 NaT 5 0 2010-03-15 10 2010-03-15 6 0 2010-04-01 10 NaT 7 1 2010-03-01 5 NaT 8 1 2010-03-15 5 NaT 9 1 2010-04-01 5 2010-04-01 10 1 2010-04-15 5 NaT 11 1 2010-05-01 5 NaT 12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill()) print df company date measure end_date 0 0 2010-01-01 10 2010-02-01 1 0 2010-01-15 10 2010-02-01 2 0 2010-02-01 10 2010-02-01 3 0 2010-02-15 10 2010-03-15 4 0 2010-03-01 10 2010-03-15 5 0 2010-03-15 10 2010-03-15 6 0 2010-04-01 10 NaT 7 1 2010-03-01 5 2010-04-01 8 1 2010-03-15 5 2010-04-01 9 1 2010-04-01 5 2010-04-01 10 1 2010-04-15 5 2010-05-15 11 1 2010-05-01 5 2010-05-15 12 1 2010-05-15 5 2010-05-15 df = df[df.end_date.notnull()] df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) print df company date measure end_date beg_date 0 0 2010-01-01 10 2010-02-01 2010-01-02 1 0 2010-01-15 10 2010-02-01 2010-01-02 2 0 2010-02-01 10 2010-02-01 2010-01-02 3 0 2010-02-15 10 2010-03-15 2010-02-13 4 0 2010-03-01 10 2010-03-15 2010-02-13 5 0 2010-03-15 10 2010-03-15 2010-02-13 7 1 2010-03-01 5 2010-04-01 2010-03-02 8 1 2010-03-15 5 2010-04-01 2010-03-02 9 1 2010-04-01 5 2010-04-01 2010-03-02 10 1 2010-04-15 5 2010-05-15 2010-04-15 11 1 2010-05-01 5 2010-05-15 2010-04-15 12 1 2010-05-15 5 2010-05-15 2010-04-15 df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df.groupby(['company','end_date']).sum() measure company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With