I'd like to know if there's a way to find the location (column and row index) of the highest value in a dataframe. So if for example my dataframe looks like this:
A B C D E
0 100 9 1 12 6
1 80 10 67 15 91
2 20 67 1 56 23
3 12 51 5 10 58
4 73 28 72 25 1
How do I get a result that looks like this: [0, 'A']
using Pandas?
Pandas DataFrame idxmax() Method The idxmax() method returns a Series with the index of the maximum value for each column. By specifying the column axis ( axis='columns' ), the idxmax() method returns a Series with the index of the maximum value for each row.
To get the position of the maximum value in a range (i.e. a list, table, or row), you can use the MAX function together with the MATCH function. Which returns the number 4, representing the position in this list of the the most expensive property. The MAX function first extracts the maximum value from the range C3:C11.
To find maximum value of every row in DataFrame just call the max() member function with DataFrame object with argument axis=1 i.e. It returned a series with row index label and maximum value of each row.
np.argmax
NumPy's argmax
can be helpful:
>>> df.stack().index[np.argmax(df.values)]
(0, 'A')
df.values
is a two-dimensional NumPy array:
>>> df.values
array([[100, 9, 1, 12, 6],
[ 80, 10, 67, 15, 91],
[ 20, 67, 1, 56, 23],
[ 12, 51, 5, 10, 58],
[ 73, 28, 72, 25, 1]])
argmax
gives you the index for the maximum value for the "flattened" array:
>>> np.argmax(df.values)
0
Now, you can use this index to find the row-column location on the stacked dataframe:
>>> df.stack().index[0]
(0, 'A')
If you need it fast, do as few steps as possible.
Working only on the NumPy array to find the indices np.argmax
seems best:
v = df.values
i, j = [x[0] for x in np.unravel_index([np.argmax(v)], v.shape)]
[df.index[i], df.columns[j]]
Result:
[0, 'A']
Timing works best for lareg data frames:
df = pd.DataFrame(data=np.arange(int(1e6)).reshape(-1,5), columns=list('ABCDE'))
Sorted slowest to fastest:
%timeit df.mask(~(df==df.max().max())).stack().index.tolist()
33.4 ms ± 982 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(df.stack().idxmax())
17.1 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.stack().index[np.argmax(df.values)]
14.8 ms ± 392 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
i,j = np.where(df.values == df.values.max())
list((df.index[i].values.tolist()[0],df.columns[j].values.tolist()[0]))
4.45 ms ± 84.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df.values
i, j = [x[0] for x in np.unravel_index([np.argmax(v)], v.shape)]
[df.index[i], df.columns[j]]
499 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
d = {'name': ['Mask', 'Stack-idmax', 'Stack-argmax', 'Where', 'Argmax-unravel_index'],
'time': [33.4, 17.1, 14.8, 4.45, 499],
'unit': ['ms', 'ms', 'ms', 'ms', 'µs']}
timings = pd.DataFrame(d)
timings['seconds'] = timings.time * timings.unit.map({'ms': 1e-3, 'µs': 1e-6})
timings['factor slower'] = timings.seconds / timings.seconds.min()
timings.sort_values('factor slower')
Output:
name time unit seconds factor slower
4 Argmax-unravel_index 499.00 µs 0.000499 1.000000
3 Where 4.45 ms 0.004450 8.917836
2 Stack-argmax 14.80 ms 0.014800 29.659319
1 Stack-idmax 17.10 ms 0.017100 34.268537
0 Mask 33.40 ms 0.033400 66.933868
So the "Argmax-unravel_index" version seems to be one to nearly two orders of magnitude faster for large data frames, i.e. where often speeds matters most.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With