Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain `min` and `idxmin` (or `max` and `idxmax`) at the same time ("simultaneously")?

I was wondering if there is a possibility of calling idxmin and min at the same time (in the same call/loop).

Assuming the following dataframe:

    id  option_1    option_2    option_3    option_4
0   0   10.0        NaN         NaN         110.0
1   1   NaN         20.0        200.0       NaN
2   2   NaN         300.0       30.0        NaN
3   3   400.0       NaN         NaN         40.0
4   4   600.0       700.0       50.0        50.0

I would like to calculate the minimum value (min) and the column that contains it (idxmin) of the option_ series:

    id  option_1    option_2    option_3    option_4    min_column  min_value
0   0   10.0        NaN         NaN         110.0       option_1        10.0
1   1   NaN         20.0        200.0       NaN         option_2        20.0
2   2   NaN         300.0       30.0        NaN         option_3        30.0
3   3   400.0       NaN         NaN         40.0        option_4        40.0
4   4   600.0       700.0       50.0        50.0        option_3        50.0

Obviously, I can call idxmin and min separatedly (one after the other, see example below), but is there a way of making this more efficient without searching the matrix twice (one for the value and another for the index)?


An example calling min and idxmin

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': [0,1,2,3,4], 
    'option_1': [10,     np.nan, np.nan, 400,    600], 
    'option_2': [np.nan, 20,     300,    np.nan, 700], 
    'option_3': [np.nan, 200,    30,     np.nan, 50],
    'option_4': [110,    np.nan, np.nan, 40,     50], 
})

df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)

(I expected this would be suboptimal as the search is performed twice.)

like image 921
toto_tico Avatar asked Aug 20 '18 14:08

toto_tico


3 Answers

Google Colab
GitHub

transpose then agg

df.set_index('id').T.agg(['min', 'idxmin']).T

  min    idxmin
0  10  option_1
1  20  option_2
2  30  option_3
3  40  option_4
4  50  option_3

Numpy v1

d_ = df.set_index('id')
v = d_.values
pd.DataFrame(dict(
    Min=np.nanmin(v, axis=1),
    Idxmin=d_.columns[np.nanargmin(v, axis=1)]
), d_.index)

      Idxmin   Min
id                
0   option_1  10.0
1   option_2  20.0
2   option_3  30.0
3   option_4  40.0
4   option_3  50.0

Numpy v2

col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
pd.DataFrame(dict(
    Min=np.nanmin(v, axis=1),
    IdxMin=options[np.nanargmin(v, axis=1)]
))

Full Simulation

Conclusion

The Numpy solutions are fastest.

Results

10 columns

         pir_agg_1  pir_agg_2  pir_agg_3  wen_agg_1  tot_agg_1  tot_agg_2
10       12.465358   1.272584        1.0   5.978435   2.168994   2.164858
30       26.538924   1.305721        1.0   5.331755   2.121342   2.193279
100      80.304708   1.277684        1.0   7.221127   2.215901   2.365835
300     230.009000   1.338177        1.0   5.869560   2.505447   2.576457
1000    661.432965   1.249847        1.0   8.931438   2.940030   3.002684
3000   1757.339186   1.349861        1.0  12.541915   4.656864   4.961188
10000  3342.701758   1.724972        1.0  15.287138   6.589233   6.782102

enter image description here

100 columns

        pir_agg_1  pir_agg_2  pir_agg_3  wen_agg_1  tot_agg_1  tot_agg_2
10       8.008895   1.000000   1.977989   5.612195   1.727308   1.769866
30      18.798077   1.000000   1.855291   4.350982   1.618649   1.699162
100     56.725786   1.000000   1.877474   6.749006   1.780816   1.850991
300    132.306699   1.000000   1.535976   7.779359   1.707254   1.721859
1000   253.771648   1.000000   1.232238  12.224478   1.855549   1.639081
3000   346.999495   2.246106   1.000000  21.114310   1.893144   1.626650
10000  431.135940   2.095874   1.000000  32.588886   2.203617   1.793076

enter image description here

Functions

def pir_agg_1(df):
  return df.set_index('id').T.agg(['min', 'idxmin']).T

def pir_agg_2(df):
  d_ = df.set_index('id')
  v = d_.values
  return pd.DataFrame(dict(
      Min=np.nanmin(v, axis=1),
      IdxMin=d_.columns[np.nanargmin(v, axis=1)]
  ))

def pir_agg_3(df):
  col_mask = df.columns.str.startswith('option')
  options = df.columns[col_mask]
  v = np.column_stack([*map(df.get, options)])
  return pd.DataFrame(dict(
      Min=np.nanmin(v, axis=1),
      IdxMin=options[np.nanargmin(v, axis=1)]
  ))

def wen_agg_1(df):
  v = df.filter(like='option')
  d = v.stack().sort_values().groupby(level=0).head(1).reset_index(level=1)
  d.columns = ['IdxMin', 'Min']
  return d

def tot_agg_1(df):
  """I combined toto_tico's 2 filter calls into one"""
  d = df.filter(like='option')
  return df.assign(
      IdxMin=d.idxmin(1),
      Min=d.min(1)
  )

def tot_agg_2(df):
  d = df.filter(like='option')
  idxmin = d.idxmin(1)
  return df.assign(
      IdxMin=idxmin,
      Min=d.lookup(d.index, idxmin)
  )

Sim setup

def sim_df(n, m):
  return pd.DataFrame(
      np.random.randint(m, size=(n, m))
  ).rename_axis('id').add_prefix('option').reset_index()


fs = 'pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2'.split()
ix = [10, 30, 100, 300, 1000, 3000, 10000]

res_small_col = pd.DataFrame(index=ix, columns=fs, dtype=float)
res_large_col = pd.DataFrame(index=ix, columns=fs, dtype=float)

for i in ix:
  df = sim_df(i, 10)
  for j in fs:
    stmt = f"{j}(df)"
    setp = f"from __main__ import {j}, df"
    res_small_col.at[i, j] = timeit(stmt, setp, number=10)

for i in ix:
  df = sim_df(i, 100)
  for j in fs:
    stmt = f"{j}(df)"
    setp = f"from __main__ import {j}, df"
    res_large_col.at[i, j] = timeit(stmt, setp, number=10)
like image 80
piRSquared Avatar answered Oct 11 '22 04:10

piRSquared


Maybe using stack with groupby

v=df.filter(like='option')
v.stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1)
Out[313]:
    level_1     0
0  option_1  10.0
1  option_2  20.0
2  option_3  30.0
3  option_4  40.0
4  option_3  50.0
like image 31
BENY Avatar answered Oct 11 '22 04:10

BENY


UPDATE 2:

The numpy solution of @piRSquared is the winner for what I would consider the most common cases. Here is his answers with a minimum modification to assign the columns to the original dataframe (which I did in all my tests, in order to be consistent with the example of the original question)

col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
          min_column = options[np.nanargmin(v, axis=1)])

You should be careful if you have a lot of columns (more than 10000), since in these extreme cases results could start changing significatively.

UPDATE 1:

According to my tests calling min and idxmin separatedly is the fastest you can do based on all the proposed answers.


Although it is not at the same time(see direct answer below), you should be better of using DataFrame.lookup on the column indexes (min_column colum), in order to avoid the search for values (min_values).

So, instead of traversing the entire matrix - which is O(n*m), you would only traverse the resulting min_column series - which is O(n):

df = pd.DataFrame({
    'id': [0,1,2,3,4], 
    'option_1': [10,     np.nan, np.nan, 400,    600], 
    'option_2': [np.nan, 20,     300,    np.nan, 700], 
    'option_3': [np.nan, 200,    30,     np.nan, 50],
    'option_4': [110,    np.nan, np.nan, 40,     50], 
})

df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])

Direct answer (not as efficient)

Since you asked about how to calculate the values "in the same call" (let's say because you simplified your example for the question), you can try a lambda expression:

def min_idxmin(x):
    _idx = x.idxmin()
    return _idx, x[_idx]

df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
    lambda x: min_idxmin(x), axis=1))

To be clear, although here the 2nd search is removed (replaced by a direct acccess in x[_idx]), this will highly likely take much longer because you are not exploitng the vectorizing properties of pandas/numpy.

Bottom line is pandas/numpy vectorized operations are very fast.



Summary of the summary:

There doesn't seem to be any advantage in using df.lookup, calling min and idxmin separatedly is better, than using the lookup which is mind blowing and deserves a question in itself.

Summary of the timings:

I tested a dataframe with 10000 rows and 10 columns (option_ sequence in the initial example). Since, I got a couple of unexpected result, I then also tested with 1000x1000, and 100x10000. According to the results:

  1. Using numpy as @piRSquared (test8) suggested is the clear winner, only start perfoming worse when there is a lot of columns (100, 10000, but does not justify the general use of it). The test9 modifies trying to using index in numpy, but it generally speaking performs worse.

  2. Calling min and idxmin separatedly was the best for the 10000x10 case, even better than the Dataframe.lookup (although, the Dataframe.lookup result performed better in the 100x10000 case). Although the shape of the data influence the results, I would argue that having 10000 columns is a bit unrealistic.

  3. The solution provided by @Wen followed in performance, though it was not better than calling idxmin and min separatedly, or using Dataframe.lookup. I did an extra test (see test7()) because I felt that the the addition of operation (reset_index and zip might be disturbing the result. It was still worse than test1 and test2, even though it does not do the assigment (I couldn't figure out how to make the assigment using the head(1)). @Wen, would you mind giving me a hand?

  4. @Wen solution underperfoms when there are more columns (1000x1000, or 100x10000), which makes sense because sorting is slower than searching. In this case, the lambda expression that I suggested performs better.

  5. Any other solution with lambda expression, or that uses the transpose (T) falls behind. The lambda expression that I suggested took around 1 second, better than the ~11 secs using the transpose T suggested by @piRSquared and @RafaelC.

TimeIt results with 10000 rows x 10 columns (pandas 0.23.4):

Using the following dataframe of 10000 rows and 10 columns:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(10000, 10)), columns=[f'option_{x}' for x in range(1,11)]).reset_index()
  1. Calling the two columns twice separatedly:

    def test1():
        df['min_column'] = df.filter(like='option').idxmin(1)
        df['min_value'] = df.filter(like='option').min(1)
    %timeit -n 100 test1()
    13 ms ± 580 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  2. Calling the lookup (it is slower for this case!):

    def test2():
        df['min_column'] = df.filter(like='option').idxmin(1)
        df['min_value'] = df.lookup(df.index, df['min_column'])    
    %timeit -n 100 test2()
    # 15.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  3. Using apply and min_idxmin(x):

    def min_idxmin(x):
        _idx = x.idxmin()
        return _idx, x[_idx]
    
    def test3():
        df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
            lambda x: min_idxmin(x), axis=1))
    %timeit -n 10 test3()
    # 968 ms ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
  4. Using agg['min', 'idxmin'] by @piRSquared:

    def test4():
        df['min_column'], df['min_value'] = zip(*df.set_index('index').filter(like='option').T.agg(['min', 'idxmin']).T.values)
    
    %timeit -n 1 test4()
    # 11.2 s ± 850 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  5. Using agg['min', 'idxmin'] by @RafaelC:

    def test5():
    
        df['min_column'], df['min_value'] = zip(*df.filter(like='option').agg(lambda x: x.agg(['min', 'idxmin']), axis=1).values)
        %timeit -n 1 test5()
        # 11.7 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  6. Sorting values by @Wen:

    def test6():
        df['min_column'], df['min_value'] = zip(*df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1).values)
    
    %timeit -n 100 test6()
    # 33.6 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  7. Sorting values by @Wen modified by me to make the comparison fairer due to overload of assigment operation (I explained why in the summary at the beginning):

    def test7():
        df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1)
    
    %timeit -n 100 test7()
    # 25 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  8. Using numpy:

    def test8():
        col_mask = df.columns.str.startswith('option')
        options = df.columns[col_mask]
        v = np.column_stack([*map(df.get, options)])
        df.assign(min_value = np.nanmin(v, axis=1),
                  min_column = options[np.nanargmin(v, axis=1)])
    
    %timeit -n 100 test8()
    # 2.76 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  9. Using numpy but avoid the search (indexing instead):

    def test9():
        col_mask = df.columns.str.startswith('option')
        options = df.columns[col_mask]
        v = np.column_stack([*map(df.get, options)])
        idxmin = np.nanargmin(v, axis=1)
        # instead of looking for the answer, indexes are used
        df.assign(min_value = v[range(v.shape[0]), idxmin],
                  min_column = options[idxmin])
    
    %timeit -n 100 test9()
    # 3.96 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

TimeIt results with 1000 rows x 1000 columns:

I perform more test with a 1000x1000 shape:

df = pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)), columns=[f'option_{x}' for x in range(1,1001)]).reset_index()

Although the results change:

test1    ~27.6ms
test2    ~29.4ms
test3    ~135ms
test4    ~1.18s
test5    ~1.29s
test6    ~287ms
test7    ~290ms
test8    ~25.7
test9    ~26.1

TimeIt results with 100 rows x 10000 columns:

I perform more test with a 100x10000 shape:

df = pd.DataFrame(np.random.randint(0,100,size=(100, 10000)), columns=[f'option_{x}' for x in range(1,10001)]).reset_index()

Although the results change:

test1    ~46.8ms
test2    ~25.6ms
test3    ~101ms
test4    ~289ms
test5    ~276ms
test6    ~349ms
test7    ~301ms
test8    ~121ms
test9    ~122ms
like image 39
toto_tico Avatar answered Oct 11 '22 05:10

toto_tico