My original dataframe looks like this :
A B C
0.10 0.83 0.07
0.40 0.30 0.30
0.70 0.17 0.13
0.72 0.04 0.24
0.15 0.07 0.78
I would like that each row becomes binarized : 1 would be assigned to the column with the highest value and the rest would be set to 0, so the previous dataframe would become :
A B C
0 1 0
1 0 0
1 0 0
1 0 0
0 0 1
How can this be done ?
Thanks.
EDIT : I understand that a specific case made my question ambiguous. I should've said that in case 3 columns are equal for a given row, I'd still want to get a [1 0 0] vector and not [1 1 1] for that row.
Using numpy
with argmax
m = np.zeros_like(df.values)
m[np.arange(len(df)), df.values.argmax(1)] = 1
df1 = pd.DataFrame(m, columns = df.columns).astype(int)
# Result
A B C
0 0 1 0
1 1 0 0
2 1 0 0
3 1 0 0
4 0 0 1
Timings
df_test = df.concat([df] * 1000)
def chris_z(df):
m = np.zeros_like(df.values)
m[np.arange(len(df)), df.values.argmax(1)] = 1
return pd.DataFrame(m, columns = df.columns).astype(int)
def haleemur(df):
return df.apply(lambda x: x == x.max(), axis=1).astype(int)
def haleemur_2(df):
return pd.DataFrame((df.T == df.T.max()).T.astype(int), columns=df.columns)
def sacul(df):
return pd.DataFrame(np.where(df.T == df.T.max(), 1, 0),index=df.columns).T
Results
In [320]: %timeit chris_z(df_test)
358 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [321]: %timeit haleemur(df_test)
1.14 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [329]: %timeit haleemur_2(df_test)
972 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [333]: %timeit sacul(df_test)
1.01 ms ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.apply(lambda x: x == x.max(), axis=1).astype(int)
should do it. This works by checking if the value is the maximum of that column, and then casting to integer (True -> 1, False -> 0)
Instead of apply
-ing a lambda row-wise, it is also possible to transpose the dataframe & compare to max
and then transpose back
(df.T == df.T.max()).T.astype(int)
And lastly, a very fast numpy based solution:
pd.DataFrame((df.T.values == np.amax(df.values, 1)).T*1, columns = df.columns)
The output is in all cases:
A B C
0 0 1 0
1 1 0 0
2 1 0 0
3 1 0 0
4 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With