Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas : SQL SelfJoin With Date Criteria

One query I often do in SQL within a relational database is to join a table back to itself and summarize each row based on records for the same id either backwards or forward in time.

For example, assume table1 as columns 'ID','Date', 'Var1'

In SQL I could sum var1 for the past 3 months for each record like this:

Select a.ID, a.Date, sum(b.Var1) as sum_var1
from table1 a
left outer join table1 b
on a.ID = b.ID
and months_between(a.date,b.date) <0
and months_between(a.date,b.date) > -3

Is there any way to do this in Pandas?

like image 526
B_Miner Avatar asked Oct 11 '18 21:10

B_Miner


1 Answers

It seems you need GroupBy + rolling. Implementing the logic in precisely the same way it is written in SQL is likely to be expensive as it will involve repeated loops. Let's take an example dataframe:

        Date  ID  Var1
0 2015-01-01   1     0
1 2015-02-01   1     1
2 2015-03-01   1     2
3 2015-04-01   1     3
4 2015-05-01   1     4
5 2015-01-01   2     5
6 2015-02-01   2     6
7 2015-03-01   2     7
8 2015-04-01   2     8
9 2015-05-01   2     9

You can add a column which, by group, looks back and sums a variable over a fixed period. First define a function utilizing pd.Series.rolling:

def lookbacker(x):
    """Sum over past 70 days"""
    return x.rolling('70D').sum().astype(int)

Then apply it on a GroupBy object and extract values for assignment:

df['Lookback_Sum'] = df.set_index('Date').groupby('ID')['Var1'].apply(lookbacker).values

print(df)

        Date  ID  Var1  Lookback_Sum
0 2015-01-01   1     0             0
1 2015-02-01   1     1             1
2 2015-03-01   1     2             3
3 2015-04-01   1     3             6
4 2015-05-01   1     4             9
5 2015-01-01   2     5             5
6 2015-02-01   2     6            11
7 2015-03-01   2     7            18
8 2015-04-01   2     8            21
9 2015-05-01   2     9            24

It appears pd.Series.rolling does not work with months, e.g. using '2M' (2 months) instead of '70D' (70 days) gives ValueError: <2 * MonthEnds> is a non-fixed frequency. This makes sense since a "month" is ambiguous given months have different numbers of days.

Another point worth mentioning is you can use GroupBy + rolling directly and possibly more efficiently by bypassing apply, but this requires ensuring your index is monotic. For example, via sort_index:

df['Lookback_Sum'] = df.set_index('Date').sort_index()\
                       .groupby('ID')['Var1'].rolling('70D').sum()\
                       .astype(int).values
like image 147
jpp Avatar answered Oct 31 '22 18:10

jpp