Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find index of a value before the maximum for each column in python dataframe

i have a dataframe as below.

test = pd.DataFrame({'col1':[0,0,1,0,0,0,1,2,0], 'col2': [0,0,1,2,3,0,0,0,0]})
   col1  col2
0     0     0
1     0     0
2     1     1
3     0     2
4     0     3
5     0     0
6     1     0
7     2     0
8     0     0

For each column, i want to find the index of value 1 before the maximum of each column. For example, for the first column, the max is 2, the index of value 1 before 2 is 6. for the second column, the max is 3, the index of value 1 before the value 3 is 2.

In summary, I am looking to get [6, 2] as the output for this test DataFrame. Is there a quick way to achieve this?

like image 245
AAA Avatar asked Jun 13 '19 16:06

AAA


People also ask

How do you find the index with the maximum value of a DataFrame?

The idxmax() method returns a Series with the index of the maximum value for each column. By specifying the column axis ( axis='columns' ), the idxmax() method returns a Series with the index of the maximum value for each row.

How do you find the max value in a column in Python?

To find the maximum value of a column and to return its corresponding row values in Pandas, we can use df. loc[df[col]. idxmax()].


4 Answers

Use Series.mask to hide elements that aren't 1, then apply Series.last_valid_index to each column.

m = test.eq(test.max()).cumsum().gt(0) | test.ne(1) 
test.mask(m).apply(pd.Series.last_valid_index)

col1    6
col2    2
dtype: int64

Using numpy to vectorize, you can use numpy.cumsum and argmax:

idx = ((test.eq(1) & test.eq(test.max()).cumsum().eq(0))
            .values
            .cumsum(axis=0)
            .argmax(axis=0))
idx
# array([6, 2])

pd.Series(idx, index=[*test])

col1    6
col2    2
dtype: int64
like image 174
cs95 Avatar answered Sep 29 '22 01:09

cs95


Using @cs95 idea of last_valid_index:

test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())

Output:

col1    6
col2    2
dtype: int64

Expained:

Using index slicing to cut each column to start to max value, then look for the values that are equal to one and find the index of the last true value.

Or as @QuangHoang suggests:

test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax()) 
like image 40
Scott Boston Avatar answered Sep 29 '22 01:09

Scott Boston


Overkill with Numpy

t = test.to_numpy()
a = t.argmax(0)

i, j = np.where(t == 1)
mask = i <= a[j]
i = i[mask]
j = j[mask]

b = np.empty_like(a)
b.fill(-1)

np.maximum.at(b, j, i)

pd.Series(b, test.columns)

col1    6
col2    2
dtype: int64

apply

test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))

col1    6
col2    2
dtype: int64

cummax

test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()

col1    6
col2    2
dtype: int64

Timing

I just wanted to use a new tool and do some bechmarking see this post

Results

r.to_pandas_dataframe().T

         10        31        100       316       1000      3162      10000
al_0  0.003696  0.003718  0.005512  0.006210  0.010973  0.007764  0.012008
wb_0  0.003348  0.003334  0.003913  0.003935  0.004583  0.004757  0.006096
qh_0  0.002279  0.002265  0.002571  0.002643  0.002927  0.003070  0.003987
sb_0  0.002235  0.002246  0.003072  0.003357  0.004136  0.004083  0.005286
sb_1  0.001771  0.001779  0.002331  0.002353  0.002914  0.002936  0.003619
cs_0  0.005742  0.005751  0.006748  0.006808  0.007845  0.008088  0.009898
cs_1  0.004034  0.004045  0.004871  0.004898  0.005769  0.005997  0.007338
pr_0  0.002484  0.006142  0.027101  0.085944  0.374629  1.292556  6.220875
pr_1  0.003388  0.003414  0.003981  0.004027  0.004658  0.004929  0.006390
pr_2  0.000087  0.000088  0.000089  0.000093  0.000107  0.000145  0.000300

fig = plt.figure(figsize=(10, 10))
ax = plt.subplot()
r.plot(ax=ax)

enter image description here

Setup

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

def al_0(test): return test.apply(lambda x: x.where(x[:x.idxmax()].eq(1)).drop_duplicates(keep='last').idxmin())
def wb_0(df): return (df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
def qh_0(test): return (test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()
def sb_0(test): return test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())
def sb_1(test): return test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())
def cs_0(test): return (lambda m: test.mask(m).apply(pd.Series.last_valid_index))(test.eq(test.max()).cumsum().gt(0) | test.ne(1))
def cs_1(test): return pd.Series((test.eq(1) & test.eq(test.max()).cumsum().eq(0)).values.cumsum(axis=0).argmax(axis=0), test.columns)
def pr_0(test): return test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))
def pr_1(test): return test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()
def pr_2(test):
    t = test.to_numpy()
    a = t.argmax(0)

    i, j = np.where(t == 1)
    mask = i <= a[j]
    i = i[mask]
    j = j[mask]

    b = np.empty_like(a)
    b.fill(-1)

    np.maximum.at(b, j, i)

    return pd.Series(b, test.columns)

import math

def gen_test(n):
    a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
    idx = a.argmax(0)
    while (idx == 0).any():
        a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
        idx = a.argmax(0)        

    for j, i in enumerate(idx):
        a[np.random.randint(i), j] = 1

    return pd.DataFrame(a).add_prefix('col')

@b.add_arguments('DataFrame Size')
def argument_provider():
    for exponent in np.linspace(1, 3, 5):
        size = int(10 ** exponent)
        yield size, gen_test(size)

b.add_functions([al_0, wb_0, qh_0, sb_0, sb_1, cs_0, cs_1, pr_0, pr_1, pr_2])

r = b.run()
like image 28
piRSquared Avatar answered Sep 29 '22 01:09

piRSquared


A little bit logic here

(df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
Out[187]: 
col1    6
col2    2
dtype: int64
like image 21
BENY Avatar answered Sep 29 '22 00:09

BENY