Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupby and lag all columns of a dataframe?

Tags:

python

pandas

I want to lag every column in a dataframe, by group. I have a frame like this:

import numpy as np
import pandas as pd

index = pd.date_range('2015-11-20', periods=6, freq='D')

df = pd.DataFrame(dict(time=index, grp=['A']*3 + ['B']*3, col1=[1,2,3]*2,
    col2=['a','b','c']*2)).set_index(['time','grp'])

which looks like

                col1 col2
time       grp           
2015-11-20 A       1    a
2015-11-21 A       2    b
2015-11-22 A       3    c
2015-11-23 B       1    a
2015-11-24 B       2    b
2015-11-25 B       3    c

and I want it to look like this:

                col1 col2 col1_lag col2_lag
time       grp                     
2015-11-20 A       1    a        2        b
2015-11-21 A       2    b        3        c
2015-11-22 A       3    c       NA       NA
2015-11-23 B       1    a        2        b
2015-11-24 B       2    b        3        c
2015-11-25 B       3    c       NA       NA

This question manages the result for a single column, but I have an arbitrary number of columns, and I want to lag all of them. I can use groupby and apply, but apply runs the shift function over each column independently, and it doesn't seem to like receiving an [nrow, 2] shaped dataframe in return. Is there perhaps a function like apply that acts on the whole group sub-frame? Or is there a better way to do this?

like image 911
naught101 Avatar asked Nov 25 '15 02:11

naught101


People also ask

Can you Groupby multiple columns in pandas?

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.

How do you lag variables in Python?

Create lag variables, using the shift function. shift(1) creates a lag of a single record, while shift(5) creates a lag of five records. This creates a lag variable based on the prior observations, but shift can also take a time offset to specify the time to use in shift.

Does Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.


1 Answers

IIUC, you can simply use level="grp" and then shift by -1:

>>> shifted = df.groupby(level="grp").shift(-1)
>>> df.join(shifted.rename(columns=lambda x: x+"_lag"))
                col1 col2  col1_lag col2_lag
time       grp                              
2015-11-20 A       1    a         2        b
2015-11-21 A       2    b         3        c
2015-11-22 A       3    c       NaN      NaN
2015-11-23 B       1    a         2        b
2015-11-24 B       2    b         3        c
2015-11-25 B       3    c       NaN      NaN
like image 64
DSM Avatar answered Oct 21 '22 03:10

DSM