Slice DataFrame using indices from other columns

Question

I have a dataframe like that :

index   value   idxmin  idxmax
0       300     nan     nan
1       200     nan     nan
2       100     nan     nan
3       200     0       2
4       300     1       2
5       400     1       3
6       500     2       5
7       600     4       5
8       700     4       7
9       800     5       8
10      900     5       8
11      800     7       9
12      700     8       10
13      600     10      12
14      500     12      13
15      400     12      14
16      500     12      15
17      400     13      15
18      500     13      16
19      600     15      17
20      700     15      19

I want to create a new column (maxvalue) that would return the maximum of "value" column for rows range. Example : for Row 9, the max of "value" from rows 5 to 8 is 800.

I have made this code, which is actually running but is not efficient

df['maxvalue'] = df.apply(lambda x : (df['value'].loc[x['idxmin']:x['idxmax']].max(), axis=1)

Do you have a more efficient code to perform that function ?

The result I expect (last column) :

index   value   idxmin  idxmax  maxvalue
0       300     nan     nan     nan
1       200     nan     nan     nan
2       100     nan     nan     nan
3       200     0       2       300
4       300     1       2       200
5       400     1       3       200
6       500     2       5       400
7       600     4       5       400
8       700     4       7       600
9       800     5       8       700
10      900     5       8       700
11      800     7       9       800
12      700     8       10      900
13      600     10      12      900
14      500     12      13      700
15      400     12      14      700
16      500     12      15      700
17      400     13      15      600
18      500     13      16      600
19      600     15      17      500
20      700     15      19      600

Many thanks for your help !!

cs95 · Accepted Answer

This operation is inherently difficult to vectorize because the array is not sorted, and the indices do not seem to represent equally sized ranges. I can suggest turning this into a list comprehension to circumvent the overhead from apply, but you're on your own after that.

df['maxvalue'] = [
    df['value'].values[int(s):int(e)].max() if pd.notna([s,e]).all() 
    else np.nan for s, e in zip(df['idxmin'], df['idxmax'])
]

df.head()
    index  value  idxmin  idxmax  maxvalue
0       0    300     NaN     NaN       NaN
1       1    200     NaN     NaN       NaN
2       2    100     NaN     NaN       NaN
3       3    200     0.0     2.0     300.0
4       4    300     1.0     2.0     200.0

In order to get the most out of this, it is necessary to transfer as much as the heavy lifting from pandas to numpy as possible. I see a 15x speedup on my machine on just a small DataFrame with 1000 rows.

df_ = df
df = pd.concat([df_] * 1000, ignore_index=True)

%timeit df.apply(
    lambda x: df['value'].loc[x['idxmin']:x['idxmax']].max(), axis=1)
%%timeit 
[
    df['value'].values[int(s):int(e)].max() if pd.notna([s,e]).all() 
    else np.nan for s, e in zip(df['idxmin'], df['idxmax'])
]

4.79 s ± 68.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
268 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Slice DataFrame using indices from other columns

Tags:

python

pandas

dataframe

python-2.7

bobo

1 Answers

cs95

Recent Activity

Donate For Us

Slice DataFrame using indices from other columns

Tags:

python

pandas

dataframe

python-2.7

bobo

1 Answers

cs95

Related questions

Recent Activity

Donate For Us