First I'm new to pandas, but I'm already falling in love with it. I'm trying to implement the equivalent of the Lag function from Oracle. Let's suppose you have this DataFrame: <pre class="prettyprint"><code>Date Group Data 2014-05-14 09:10:00 A 1 2014-05-14 09:20:00 A 2 2014-05-14 09:30:00 A 3 2014-05-14 09:40:00 A 4 2014-05-14 09:50:00 A 5 2014-05-14 10:00:00 B 1 2014-05-14 10:10:00 B 2 2014-05-14 10:20:00 B 3 2014-05-14 10:30:00 B 4 </code></pre> If this was an oracle database and I wanted to create a lag function grouped by the "Group" column and ordered by the Date I could easily use this function: <pre class="prettyprint"><code> LAG(Data,1,NULL) OVER (PARTITION BY Group ORDER BY Date ASC) AS Data_lagged </code></pre> This would result in the following Table: <pre class="prettyprint"><code>Date Group Data Data lagged 2014-05-14 09:10:00 A 1 Null 2014-05-14 09:20:00 A 2 1 2014-05-14 09:30:00 A 3 2 2014-05-14 09:40:00 A 4 3 2014-05-14 09:50:00 A 5 4 2014-05-14 10:00:00 B 1 Null 2014-05-14 10:10:00 B 2 1 2014-05-14 10:20:00 B 3 2 2014-05-14 10:30:00 B 4 3 </code></pre> In pandas I can set the date to be an index and use the shift method: <pre class="prettyprint"><code>db["Data_lagged"] = db.Data.shift(1) </code></pre> The only issue is that this doesn't group by a column. Even if I set the two columns Date and Group as indexes, I would still get the "5" in the lagged column. Is there a way to implement the equivalent of the Lead and lag functions in Pandas?

For lead operation in pandas, one need to just use <code>shift(-1)</code> instead of 1 <code>df['Data_lead'] = df.groupby(['Group'])['Data'].shift(-1)</code>

Pandas equivalent of Oracle Lead/Lag function

Tags:

python

pandas

First I'm new to pandas, but I'm already falling in love with it. I'm trying to implement the equivalent of the Lag function from Oracle.

Let's suppose you have this DataFrame:

Date                   Group      Data
2014-05-14 09:10:00        A         1
2014-05-14 09:20:00        A         2
2014-05-14 09:30:00        A         3
2014-05-14 09:40:00        A         4
2014-05-14 09:50:00        A         5
2014-05-14 10:00:00        B         1
2014-05-14 10:10:00        B         2
2014-05-14 10:20:00        B         3
2014-05-14 10:30:00        B         4

If this was an oracle database and I wanted to create a lag function grouped by the "Group" column and ordered by the Date I could easily use this function:

 LAG(Data,1,NULL) OVER (PARTITION BY Group ORDER BY Date ASC) AS Data_lagged

This would result in the following Table:

Date                   Group     Data    Data lagged
2014-05-14 09:10:00        A        1           Null
2014-05-14 09:20:00        A        2            1
2014-05-14 09:30:00        A        3            2
2014-05-14 09:40:00        A        4            3
2014-05-14 09:50:00        A        5            4
2014-05-14 10:00:00        B        1           Null
2014-05-14 10:10:00        B        2            1
2014-05-14 10:20:00        B        3            2
2014-05-14 10:30:00        B        4            3

In pandas I can set the date to be an index and use the shift method:

db["Data_lagged"] = db.Data.shift(1)

The only issue is that this doesn't group by a column. Even if I set the two columns Date and Group as indexes, I would still get the "5" in the lagged column.

Is there a way to implement the equivalent of the Lead and lag functions in Pandas?

794

asked May 14 '14 20:05

gcarmiol

2 Answers

You could perform a groupby/apply (shift) operation:

In [15]: df['Data_lagged'] = df.groupby(['Group'])['Data'].shift(1)

In [16]: df
Out[16]: 
                Date Group  Data  Data_lagged
2014-05-14  09:10:00     A     1          NaN
2014-05-14  09:20:00     A     2            1
2014-05-14  09:30:00     A     3            2
2014-05-14  09:40:00     A     4            3
2014-05-14  09:50:00     A     5            4
2014-05-14  10:00:00     B     1          NaN
2014-05-14  10:10:00     B     2            1
2014-05-14  10:20:00     B     3            2
2014-05-14  10:30:00     B     4            3

[9 rows x 4 columns]

To obtain the ORDER BY Date ASC effect, you must sort the DataFrame first:

df['Data_lagged'] = (df.sort_values(by=['Date'], ascending=True)
                       .groupby(['Group'])['Data'].shift(1))

180

answered Oct 12 '22 13:10

unutbu

For lead operation in pandas, one need to just use shift(-1) instead of 1

df['Data_lead'] = df.groupby(['Group'])['Data'].shift(-1)

answered Oct 12 '22 13:10

Rahul Mehta

Related questions
                            
                                Python - How do I convert "an OS-level handle to an open file" to a file object?
                            
                                Overriding a static method in python
                            
                                Python - IOError: [Errno 13] Permission denied:
                            
                                Why does `None is None is None` return True? [duplicate]
                            
                                Python: slicing a multi-dimensional array
                            
                                How to copy/paste DataFrame from Stack Overflow into Python
                            
                                Sublime Text 2 console input [duplicate]
                            
                                What does the "fit" method in scikit-learn do? [closed]
                            
                                Can you migrate backwards to before the first migration in South?
                            
                                Can Mustache Templates do template extension?
                            
                                TypeError: expected a character buffer object - while trying to save integer to textfile
                            
                                Yield multiple values
                            
                                Can you patch *just* a nested function with closure, or must the whole outer function be repeated?
                            
                                Apache Spark -- Assign the result of UDF to multiple dataframe columns
                            
                                Extension methods in Python
                            
                                anaconda - path environment variable in windows
                            
                                Using 'in' to match an attribute of Python objects in an array
                            
                                Checking fuzzy/approximate substring existing in a longer string, in Python?
                            
                                Add element to a JSON file?
                            
                                Write file with specific permissions in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With