Pandas DataFrame: How to calculate the difference by first row and last row in group?

Tags:

Here is my pandas DataFrame:

import pandas as pd
import numpy as np

data = {"column1": [338, 519, 871, 1731, 2693, 2963, 3379, 3789, 3910, 4109, 4307, 4800, 4912, 5111, 5341, 5820, 6003, ...],
         "column2": [NaN, 1, 1, 1, 1, NaN, NaN, 2, 2, NaN, NaN, 3, 3, 3, 3, 3, NaN, NaN], ...}

df = pd.DataFrame(data)
df
>>>      column1  column2
0        338      NaN
1        519      1.0
2        871      1.0
3       1731      1.0
4       2693      1.0
5       2963      NaN
6       3379      NaN
7       3789      2.0
8       3910      2.0
9       4109      NaN
10      4307      NaN
11      4800      3.0
12      4912      3.0
13      5111      3.0
14      5341      3.0
15      5820      3.0
16      6003      NaN
17      ....      ....

The integers in column2 denote "groups" in column1, e.g. rows 1-4 is group "1", rows 7-8 is group "2", rows 11-15 is group "3", etc.

I would like to calculate the difference between the first row and last row in each group. The resulting dataframe would look like this:

df
>>>      column1  column2  column3
0        338      NaN      NaN
1        519      1.0      2174
2        871      1.0      2174
3       1731      1.0      2174
4       2693      1.0      2174
5       2963      NaN      NaN
6       3379      NaN      NaN
7       3789      2.0      121
8       3910      2.0      121
9       4109      NaN      NaN
10      4307      NaN      NaN
11      4800      3.0      1020
12      4912      3.0      1020
13      5111      3.0      1020
14      5341      3.0      1020
15      5820      3.0      1020
16      6003      NaN      NaN
17      ....      ....     ...

because:

2693-519 = 2174
3910-3789 = 121
5820-4800 = 1020

What is the "pandas way" to calculate column3? Somehow, one must iterate through column3, looking for consecutive groups of values such that df.column2 != "NaN".

EDIT: I realized my example may lead readers to assume the values in column1 are only increasing. Actually, there are intervals, column intervals

df = pd.DataFrame(data)
df
>>>    interval      column1  column2
0      interval1     338      NaN
1      interval1     519      1.0
2      interval1     871      1.0
3      interval1     1731      1.0
4      interval1     2693      1.0
5      interval1     2963      NaN
6      interval1     3379      NaN
7      interval1     3789      2.0
8      interval1     3910      2.0
9      interval1     4109      NaN
10     interval1     4307      NaN
11     interval1     4800      3.0
12     interval1     4912      3.0
13     interval1     5111      3.0
14     interval1     5341      3.0
15     interval1     5820      3.0
16     interval1     6003      NaN
17      ....      ....
18     interval2     12        13
19     interval2     115       13
20     interval2     275       NaN
....

420

asked Apr 20 '17 11:04

ShanZhengYang

1 Answers

You can filter first and then get difference first and last value in transform:

df['col3'] = df[df.column2.notnull()]
                  .groupby('column2')['column1']
                  .transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
    column1  column2    col3
0       338      NaN     NaN
1       519      1.0  2174.0
2       871      1.0  2174.0
3      1731      1.0  2174.0
4      2693      1.0  2174.0
5      2963      NaN     NaN
6      3379      NaN     NaN
7      3789      2.0   121.0
8      3910      2.0   121.0
9      4109      NaN     NaN
10     4307      NaN     NaN
11     4800      3.0  1020.0
12     4912      3.0  1020.0
13     5111      3.0  1020.0
14     5341      3.0  1020.0
15     5820      3.0  1020.0
16     6003      NaN     NaN

EDIT1 by your new data:

df['col3'] = df[df.column2.notnull()]
                  .groupby('column2')['column1']
                  .transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
     interval  column1  column2    col3
0   interval1      338      NaN     NaN
1   interval1      519      1.0  2174.0
2   interval1      871      1.0  2174.0
3   interval1     1731      1.0  2174.0
4   interval1     2693      1.0  2174.0
5   interval1     2963      NaN     NaN
6   interval1     3379      NaN     NaN
7   interval1     3789      2.0   121.0
8   interval1     3910      2.0   121.0
9   interval1     4109      NaN     NaN
10  interval1     4307      NaN     NaN
11  interval1     4800      3.0  1020.0
12  interval1     4912      3.0  1020.0
13  interval1     5111      3.0  1020.0
14  interval1     5341      3.0  1020.0
15  interval1     5820      3.0  1020.0
16  interval1     6003      NaN     NaN
18  interval2       12     13.0   103.0
19  interval2      115     13.0   103.0
20  interval2      275      NaN     NaN

answered Oct 04 '22 23:10

jezrael

Related questions
                            
                                Fix 'new enumerations must be created as'
                            
                                Python eval doesn't work inside a function [duplicate]
                            
                                Python OpenCV - overlay an image with transparency
                            
                                Pandas DataFrame iloc spoils the data type
                            
                                IOError: [Errno 2] No such file or directory: 'README.md'
                            
                                How can I send Json Data from javaScript to Flask
                            
                                Flask application GET returning the same thing twice
                            
                                Jupyter Notebook: Timeout waiting for kernel_info_reply
                            
                                Undefined variable from import when using vtk
                            
                                Python tf-idf: fast way to update the tf-idf matrix
                            
                                Python pyodbc cursor vs database cursor
                            
                                making requests to localhost from inside docker container
                            
                                Keras: Tokenizer with fit_generator() on text data
                            
                                'numpy.ndarray' object has no attribute 'count'
                            
                                Pandas - Duplicate Row based on condition
                            
                                Peek of multiprocessing.queue?
                            
                                Pandas/matplotlib plot with date-axis shows correct day/month but wrong weekday/year
                            
                                URL based database routing
                            
                                How to send data via POST or GET in Mod_Python?
                            
                                What is output tensor of Max Pooling 2D Layer in TensorFlow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas DataFrame: How to calculate the difference by first row and last row in group?

Tags:

python

pandas

dataframe

ShanZhengYang

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us