First I'm new to pandas, but I'm already falling in love with it. I'm trying to implement the equivalent of the Lag function from Oracle.
Let's suppose you have this DataFrame:
Date Group Data
2014-05-14 09:10:00 A 1
2014-05-14 09:20:00 A 2
2014-05-14 09:30:00 A 3
2014-05-14 09:40:00 A 4
2014-05-14 09:50:00 A 5
2014-05-14 10:00:00 B 1
2014-05-14 10:10:00 B 2
2014-05-14 10:20:00 B 3
2014-05-14 10:30:00 B 4
If this was an oracle database and I wanted to create a lag function grouped by the "Group" column and ordered by the Date I could easily use this function:
LAG(Data,1,NULL) OVER (PARTITION BY Group ORDER BY Date ASC) AS Data_lagged
This would result in the following Table:
Date Group Data Data lagged
2014-05-14 09:10:00 A 1 Null
2014-05-14 09:20:00 A 2 1
2014-05-14 09:30:00 A 3 2
2014-05-14 09:40:00 A 4 3
2014-05-14 09:50:00 A 5 4
2014-05-14 10:00:00 B 1 Null
2014-05-14 10:10:00 B 2 1
2014-05-14 10:20:00 B 3 2
2014-05-14 10:30:00 B 4 3
In pandas I can set the date to be an index and use the shift method:
db["Data_lagged"] = db.Data.shift(1)
The only issue is that this doesn't group by a column. Even if I set the two columns Date and Group as indexes, I would still get the "5" in the lagged column.
Is there a way to implement the equivalent of the Lead and lag functions in Pandas?
In Python, the pandas library includes built-in functionalities that allow you to perform different tasks with only a few lines of code. One of these functionalities is the creation of lags and leads of a column. lag shifts a column down by a certain number.
The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.
The LAG and LEAD functions are OLAP ranking functions that return the value of their expression argument for the row at a specified offset from the current row within the current window partition.
You could perform a groupby/apply (shift) operation:
In [15]: df['Data_lagged'] = df.groupby(['Group'])['Data'].shift(1)
In [16]: df
Out[16]:
Date Group Data Data_lagged
2014-05-14 09:10:00 A 1 NaN
2014-05-14 09:20:00 A 2 1
2014-05-14 09:30:00 A 3 2
2014-05-14 09:40:00 A 4 3
2014-05-14 09:50:00 A 5 4
2014-05-14 10:00:00 B 1 NaN
2014-05-14 10:10:00 B 2 1
2014-05-14 10:20:00 B 3 2
2014-05-14 10:30:00 B 4 3
[9 rows x 4 columns]
To obtain the ORDER BY Date ASC
effect, you must sort the DataFrame first:
df['Data_lagged'] = (df.sort_values(by=['Date'], ascending=True)
.groupby(['Group'])['Data'].shift(1))
For lead operation in pandas, one need to just use shift(-1)
instead of 1
df['Data_lead'] = df.groupby(['Group'])['Data'].shift(-1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With