Here is my pandas DataFrame:
import pandas as pd
import numpy as np
data = {"column1": [338, 519, 871, 1731, 2693, 2963, 3379, 3789, 3910, 4109, 4307, 4800, 4912, 5111, 5341, 5820, 6003, ...],
"column2": [NaN, 1, 1, 1, 1, NaN, NaN, 2, 2, NaN, NaN, 3, 3, 3, 3, 3, NaN, NaN], ...}
df = pd.DataFrame(data)
df
>>> column1 column2
0 338 NaN
1 519 1.0
2 871 1.0
3 1731 1.0
4 2693 1.0
5 2963 NaN
6 3379 NaN
7 3789 2.0
8 3910 2.0
9 4109 NaN
10 4307 NaN
11 4800 3.0
12 4912 3.0
13 5111 3.0
14 5341 3.0
15 5820 3.0
16 6003 NaN
17 .... ....
The integers in column2
denote "groups" in column1
, e.g. rows 1-4 is group "1", rows 7-8 is group "2", rows 11-15 is group "3", etc.
I would like to calculate the difference between the first row and last row in each group. The resulting dataframe would look like this:
df
>>> column1 column2 column3
0 338 NaN NaN
1 519 1.0 2174
2 871 1.0 2174
3 1731 1.0 2174
4 2693 1.0 2174
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121
8 3910 2.0 121
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020
12 4912 3.0 1020
13 5111 3.0 1020
14 5341 3.0 1020
15 5820 3.0 1020
16 6003 NaN NaN
17 .... .... ...
because:
2693-519 = 2174
3910-3789 = 121
5820-4800 = 1020
What is the "pandas way" to calculate column3
? Somehow, one must iterate through column3
, looking for consecutive groups of values such that df.column2 != "NaN"
.
EDIT: I realized my example may lead readers to assume the values in column1
are only increasing. Actually, there are intervals, column intervals
df = pd.DataFrame(data)
df
>>> interval column1 column2
0 interval1 338 NaN
1 interval1 519 1.0
2 interval1 871 1.0
3 interval1 1731 1.0
4 interval1 2693 1.0
5 interval1 2963 NaN
6 interval1 3379 NaN
7 interval1 3789 2.0
8 interval1 3910 2.0
9 interval1 4109 NaN
10 interval1 4307 NaN
11 interval1 4800 3.0
12 interval1 4912 3.0
13 interval1 5111 3.0
14 interval1 5341 3.0
15 interval1 5820 3.0
16 interval1 6003 NaN
17 .... ....
18 interval2 12 13
19 interval2 115 13
20 interval2 275 NaN
....
You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.
You can get the first row with iloc[0] and the last row with iloc[-1] . If you want to get the value of the element, you can do with iloc[0]['column_name'] , iloc[-1]['column_name'] .
iloc[-2] will get you the penultimate row info for all columns. Where df. shape[0] gets your row count, and -2 removes 2 from it to give you the index number for your penultimate row. Then you give it the required column name to only return that value.
You can filter first and then get difference first and last value in transform
:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
column1 column2 col3
0 338 NaN NaN
1 519 1.0 2174.0
2 871 1.0 2174.0
3 1731 1.0 2174.0
4 2693 1.0 2174.0
5 2963 NaN NaN
6 3379 NaN NaN
7 3789 2.0 121.0
8 3910 2.0 121.0
9 4109 NaN NaN
10 4307 NaN NaN
11 4800 3.0 1020.0
12 4912 3.0 1020.0
13 5111 3.0 1020.0
14 5341 3.0 1020.0
15 5820 3.0 1020.0
16 6003 NaN NaN
EDIT1 by your new data:
df['col3'] = df[df.column2.notnull()]
.groupby('column2')['column1']
.transform(lambda x: x.iat[-1] - x.iat[0])
print (df)
interval column1 column2 col3
0 interval1 338 NaN NaN
1 interval1 519 1.0 2174.0
2 interval1 871 1.0 2174.0
3 interval1 1731 1.0 2174.0
4 interval1 2693 1.0 2174.0
5 interval1 2963 NaN NaN
6 interval1 3379 NaN NaN
7 interval1 3789 2.0 121.0
8 interval1 3910 2.0 121.0
9 interval1 4109 NaN NaN
10 interval1 4307 NaN NaN
11 interval1 4800 3.0 1020.0
12 interval1 4912 3.0 1020.0
13 interval1 5111 3.0 1020.0
14 interval1 5341 3.0 1020.0
15 interval1 5820 3.0 1020.0
16 interval1 6003 NaN NaN
18 interval2 12 13.0 103.0
19 interval2 115 13.0 103.0
20 interval2 275 NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With