Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame: How to calculate the difference by first row and last row in group?

Here is my pandas DataFrame:

import pandas as pd
import numpy as np

data = {"column1": [338, 519, 871, 1731, 2693, 2963, 3379, 3789, 3910, 4109, 4307, 4800, 4912, 5111, 5341, 5820, 6003, ...],
         "column2": [NaN, 1, 1, 1, 1, NaN, NaN, 2, 2, NaN, NaN, 3, 3, 3, 3, 3, NaN, NaN], ...}

df = pd.DataFrame(data)
df
>>>      column1  column2
0        338      NaN
1        519      1.0
2        871      1.0
3       1731      1.0
4       2693      1.0
5       2963      NaN
6       3379      NaN
7       3789      2.0
8       3910      2.0
9       4109      NaN
10      4307      NaN
11      4800      3.0
12      4912      3.0
13      5111      3.0
14      5341      3.0
15      5820      3.0
16      6003      NaN
17      ....      ....

The integers in column2 denote "groups" in column1, e.g. rows 1-4 is group "1", rows 7-8 is group "2", rows 11-15 is group "3", etc.

I would like to calculate the difference between the first row and last row in each group. The resulting dataframe would look like this:

df
>>>      column1  column2  column3
0        338      NaN      NaN
1        519      1.0      2174
2        871      1.0      2174
3       1731      1.0      2174
4       2693      1.0      2174
5       2963      NaN      NaN
6       3379      NaN      NaN
7       3789      2.0      121
8       3910      2.0      121
9       4109      NaN      NaN
10      4307      NaN      NaN
11      4800      3.0      1020
12      4912      3.0      1020
13      5111      3.0      1020
14      5341      3.0      1020
15      5820      3.0      1020
16      6003      NaN      NaN
17      ....      ....     ...

because:

2693-519 = 2174
3910-3789 = 121
5820-4800 = 1020

What is the "pandas way" to calculate column3? Somehow, one must iterate through column3, looking for consecutive groups of values such that df.column2 != "NaN".

EDIT: I realized my example may lead readers to assume the values in column1 are only increasing. Actually, there are intervals, column intervals

df = pd.DataFrame(data)
df
>>>    interval      column1  column2
0      interval1     338      NaN
1      interval1     519      1.0
2      interval1     871      1.0
3      interval1     1731      1.0
4      interval1     2693      1.0
5      interval1     2963      NaN
6      interval1     3379      NaN
7      interval1     3789      2.0
8      interval1     3910      2.0
9      interval1     4109      NaN
10     interval1     4307      NaN
11     interval1     4800      3.0
12     interval1     4912      3.0
13     interval1     5111      3.0
14     interval1     5341      3.0
15     interval1     5820      3.0
16     interval1     6003      NaN
17      ....      ....
18     interval2     12        13
19     interval2     115       13
20     interval2     275       NaN
....
like image 420
ShanZhengYang Avatar asked Apr 20 '17 11:04

ShanZhengYang


People also ask

How do you find the difference between two rows in pandas?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.

How do you find the last and first row of a DataFrame?

You can get the first row with iloc[0] and the last row with iloc[-1] . If you want to get the value of the element, you can do with iloc[0]['column_name'] , iloc[-1]['column_name'] .

How do I select the second last row in pandas?

iloc[-2] will get you the penultimate row info for all columns. Where df. shape[0] gets your row count, and -2 removes 2 from it to give you the index number for your penultimate row. Then you give it the required column name to only return that value.


1 Answers

You can filter first and then get difference first and last value in transform:

df['col3'] = df[df.column2.notnull()]
                  .groupby('column2')['column1']
                  .transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
    column1  column2    col3
0       338      NaN     NaN
1       519      1.0  2174.0
2       871      1.0  2174.0
3      1731      1.0  2174.0
4      2693      1.0  2174.0
5      2963      NaN     NaN
6      3379      NaN     NaN
7      3789      2.0   121.0
8      3910      2.0   121.0
9      4109      NaN     NaN
10     4307      NaN     NaN
11     4800      3.0  1020.0
12     4912      3.0  1020.0
13     5111      3.0  1020.0
14     5341      3.0  1020.0
15     5820      3.0  1020.0
16     6003      NaN     NaN

EDIT1 by your new data:

df['col3'] = df[df.column2.notnull()]
                  .groupby('column2')['column1']
                  .transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
     interval  column1  column2    col3
0   interval1      338      NaN     NaN
1   interval1      519      1.0  2174.0
2   interval1      871      1.0  2174.0
3   interval1     1731      1.0  2174.0
4   interval1     2693      1.0  2174.0
5   interval1     2963      NaN     NaN
6   interval1     3379      NaN     NaN
7   interval1     3789      2.0   121.0
8   interval1     3910      2.0   121.0
9   interval1     4109      NaN     NaN
10  interval1     4307      NaN     NaN
11  interval1     4800      3.0  1020.0
12  interval1     4912      3.0  1020.0
13  interval1     5111      3.0  1020.0
14  interval1     5341      3.0  1020.0
15  interval1     5820      3.0  1020.0
16  interval1     6003      NaN     NaN
18  interval2       12     13.0   103.0
19  interval2      115     13.0   103.0
20  interval2      275      NaN     NaN
like image 59
jezrael Avatar answered Oct 04 '22 23:10

jezrael