I have a dataframe like that :
index value idxmin idxmax
0 300 nan nan
1 200 nan nan
2 100 nan nan
3 200 0 2
4 300 1 2
5 400 1 3
6 500 2 5
7 600 4 5
8 700 4 7
9 800 5 8
10 900 5 8
11 800 7 9
12 700 8 10
13 600 10 12
14 500 12 13
15 400 12 14
16 500 12 15
17 400 13 15
18 500 13 16
19 600 15 17
20 700 15 19
I want to create a new column (maxvalue) that would return the maximum of "value" column for rows range. Example : for Row 9, the max of "value" from rows 5 to 8 is 800.
I have made this code, which is actually running but is not efficient
df['maxvalue'] = df.apply(lambda x : (df['value'].loc[x['idxmin']:x['idxmax']].max(), axis=1)
Do you have a more efficient code to perform that function ?
The result I expect (last column) :
index value idxmin idxmax maxvalue
0 300 nan nan nan
1 200 nan nan nan
2 100 nan nan nan
3 200 0 2 300
4 300 1 2 200
5 400 1 3 200
6 500 2 5 400
7 600 4 5 400
8 700 4 7 600
9 800 5 8 700
10 900 5 8 700
11 800 7 9 800
12 700 8 10 900
13 600 10 12 900
14 500 12 13 700
15 400 12 14 700
16 500 12 15 700
17 400 13 15 600
18 500 13 16 600
19 600 15 17 500
20 700 15 19 600
Many thanks for your help !!
This operation is inherently difficult to vectorize because the array is not sorted, and the indices do not seem to represent equally sized ranges. I can suggest turning this into a list comprehension to circumvent the overhead from apply, but you're on your own after that.
df['maxvalue'] = [
df['value'].values[int(s):int(e)].max() if pd.notna([s,e]).all()
else np.nan for s, e in zip(df['idxmin'], df['idxmax'])
]
df.head()
index value idxmin idxmax maxvalue
0 0 300 NaN NaN NaN
1 1 200 NaN NaN NaN
2 2 100 NaN NaN NaN
3 3 200 0.0 2.0 300.0
4 4 300 1.0 2.0 200.0
In order to get the most out of this, it is necessary to transfer as much as the heavy lifting from pandas to numpy as possible. I see a 15x speedup on my machine on just a small DataFrame with 1000 rows.
df_ = df
df = pd.concat([df_] * 1000, ignore_index=True)
%timeit df.apply(
lambda x: df['value'].loc[x['idxmin']:x['idxmax']].max(), axis=1)
%%timeit
[
df['value'].values[int(s):int(e)].max() if pd.notna([s,e]).all()
else np.nan for s, e in zip(df['idxmin'], df['idxmax'])
]
4.79 s ± 68.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
268 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With