Pandas : SQL SelfJoin With Date Criteria

Question

One query I often do in SQL within a relational database is to join a table back to itself and summarize each row based on records for the same id either backwards or forward in time.

For example, assume table1 as columns 'ID','Date', 'Var1'

In SQL I could sum var1 for the past 3 months for each record like this:

Select a.ID, a.Date, sum(b.Var1) as sum_var1
from table1 a
left outer join table1 b
on a.ID = b.ID
and months_between(a.date,b.date) <0
and months_between(a.date,b.date) > -3

Is there any way to do this in Pandas?

jpp · Accepted Answer

It seems you need GroupBy + rolling. Implementing the logic in precisely the same way it is written in SQL is likely to be expensive as it will involve repeated loops. Let's take an example dataframe:

        Date  ID  Var1
0 2015-01-01   1     0
1 2015-02-01   1     1
2 2015-03-01   1     2
3 2015-04-01   1     3
4 2015-05-01   1     4
5 2015-01-01   2     5
6 2015-02-01   2     6
7 2015-03-01   2     7
8 2015-04-01   2     8
9 2015-05-01   2     9

You can add a column which, by group, looks back and sums a variable over a fixed period. First define a function utilizing pd.Series.rolling:

def lookbacker(x):
    """Sum over past 70 days"""
    return x.rolling('70D').sum().astype(int)

Then apply it on a GroupBy object and extract values for assignment:

df['Lookback_Sum'] = df.set_index('Date').groupby('ID')['Var1'].apply(lookbacker).values

print(df)

        Date  ID  Var1  Lookback_Sum
0 2015-01-01   1     0             0
1 2015-02-01   1     1             1
2 2015-03-01   1     2             3
3 2015-04-01   1     3             6
4 2015-05-01   1     4             9
5 2015-01-01   2     5             5
6 2015-02-01   2     6            11
7 2015-03-01   2     7            18
8 2015-04-01   2     8            21
9 2015-05-01   2     9            24

It appears pd.Series.rolling does not work with months, e.g. using '2M' (2 months) instead of '70D' (70 days) gives ValueError: <2 * MonthEnds> is a non-fixed frequency. This makes sense since a "month" is ambiguous given months have different numbers of days.

Another point worth mentioning is you can use GroupBy + rolling directly and possibly more efficiently by bypassing apply, but this requires ensuring your index is monotic. For example, via sort_index:

df['Lookback_Sum'] = df.set_index('Date').sort_index()\
                       .groupby('ID')['Var1'].rolling('70D').sum()\
                       .astype(int).values

Pandas : SQL SelfJoin With Date Criteria

Tags:

python

pandas

pandas-groupby

B_Miner

1 Answers

jpp

Recent Activity

Donate For Us

Pandas : SQL SelfJoin With Date Criteria

Tags:

python

pandas

pandas-groupby

B_Miner

1 Answers

jpp

Related questions

Recent Activity

Donate For Us