Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting highest value in row to 1 and rest to 0 in pandas

My original dataframe looks like this :

A       B       C
0.10    0.83    0.07
0.40    0.30    0.30
0.70    0.17    0.13    
0.72    0.04    0.24    
0.15    0.07    0.78    

I would like that each row becomes binarized : 1 would be assigned to the column with the highest value and the rest would be set to 0, so the previous dataframe would become :

A   B   C
0   1   0
1   0   0
1   0   0   
1   0   0   
0   0   1   

How can this be done ?
Thanks.

EDIT : I understand that a specific case made my question ambiguous. I should've said that in case 3 columns are equal for a given row, I'd still want to get a [1 0 0] vector and not [1 1 1] for that row.

like image 451
mlx Avatar asked Dec 13 '22 16:12

mlx


2 Answers

Using numpy with argmax

m = np.zeros_like(df.values)
m[np.arange(len(df)), df.values.argmax(1)] = 1

df1 = pd.DataFrame(m, columns = df.columns).astype(int)

# Result


   A  B  C
0  0  1  0
1  1  0  0
2  1  0  0
3  1  0  0
4  0  0  1

Timings

df_test = df.concat([df] * 1000)

def chris_z(df):
     m = np.zeros_like(df.values)
     m[np.arange(len(df)), df.values.argmax(1)] = 1
     return pd.DataFrame(m, columns = df.columns).astype(int)

def haleemur(df):
    return df.apply(lambda x: x == x.max(), axis=1).astype(int)

def haleemur_2(df):
    return pd.DataFrame((df.T == df.T.max()).T.astype(int), columns=df.columns)

def sacul(df):
    return pd.DataFrame(np.where(df.T == df.T.max(), 1, 0),index=df.columns).T

Results

In [320]: %timeit chris_z(df_test)
358 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [321]: %timeit haleemur(df_test)
1.14 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [329]: %timeit haleemur_2(df_test)
972 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [333]: %timeit sacul(df_test)
1.01 ms ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
like image 88
user3483203 Avatar answered Dec 29 '22 12:12

user3483203


 df.apply(lambda x: x == x.max(), axis=1).astype(int) 

should do it. This works by checking if the value is the maximum of that column, and then casting to integer (True -> 1, False -> 0)

Instead of apply-ing a lambda row-wise, it is also possible to transpose the dataframe & compare to max and then transpose back

(df.T == df.T.max()).T.astype(int)

And lastly, a very fast numpy based solution:

pd.DataFrame((df.T.values == np.amax(df.values, 1)).T*1, columns = df.columns)

The output is in all cases:

   A  B  C
0  0  1  0
1  1  0  0
2  1  0  0
3  1  0  0
4  0  0  1
like image 22
Haleemur Ali Avatar answered Dec 29 '22 12:12

Haleemur Ali