Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slice DataFrame using indices from other columns

I have a dataframe like that :

index   value   idxmin  idxmax
0       300     nan     nan
1       200     nan     nan
2       100     nan     nan
3       200     0       2
4       300     1       2
5       400     1       3
6       500     2       5
7       600     4       5
8       700     4       7
9       800     5       8
10      900     5       8
11      800     7       9
12      700     8       10
13      600     10      12
14      500     12      13
15      400     12      14
16      500     12      15
17      400     13      15
18      500     13      16
19      600     15      17
20      700     15      19

I want to create a new column (maxvalue) that would return the maximum of "value" column for rows range. Example : for Row 9, the max of "value" from rows 5 to 8 is 800.

I have made this code, which is actually running but is not efficient

df['maxvalue'] = df.apply(lambda x : (df['value'].loc[x['idxmin']:x['idxmax']].max(), axis=1)

Do you have a more efficient code to perform that function ?

The result I expect (last column) :

index   value   idxmin  idxmax  maxvalue
0       300     nan     nan     nan
1       200     nan     nan     nan
2       100     nan     nan     nan
3       200     0       2       300
4       300     1       2       200
5       400     1       3       200
6       500     2       5       400
7       600     4       5       400
8       700     4       7       600
9       800     5       8       700
10      900     5       8       700
11      800     7       9       800
12      700     8       10      900
13      600     10      12      900
14      500     12      13      700
15      400     12      14      700
16      500     12      15      700
17      400     13      15      600
18      500     13      16      600
19      600     15      17      500
20      700     15      19      600

Many thanks for your help !!

like image 790
bobo Avatar asked Feb 06 '26 04:02

bobo


1 Answers

This operation is inherently difficult to vectorize because the array is not sorted, and the indices do not seem to represent equally sized ranges. I can suggest turning this into a list comprehension to circumvent the overhead from apply, but you're on your own after that.

df['maxvalue'] = [
    df['value'].values[int(s):int(e)].max() if pd.notna([s,e]).all() 
    else np.nan for s, e in zip(df['idxmin'], df['idxmax'])
]

df.head()
    index  value  idxmin  idxmax  maxvalue
0       0    300     NaN     NaN       NaN
1       1    200     NaN     NaN       NaN
2       2    100     NaN     NaN       NaN
3       3    200     0.0     2.0     300.0
4       4    300     1.0     2.0     200.0

In order to get the most out of this, it is necessary to transfer as much as the heavy lifting from pandas to numpy as possible. I see a 15x speedup on my machine on just a small DataFrame with 1000 rows.

df_ = df
df = pd.concat([df_] * 1000, ignore_index=True)

%timeit df.apply(
    lambda x: df['value'].loc[x['idxmin']:x['idxmax']].max(), axis=1)
%%timeit 
[
    df['value'].values[int(s):int(e)].max() if pd.notna([s,e]).all() 
    else np.nan for s, e in zip(df['idxmin'], df['idxmax'])
]

4.79 s ± 68.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
268 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 120
cs95 Avatar answered Feb 12 '26 18:02

cs95



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!