I am trying to compare the three floats in a row of a dataframe that is (500000x3), I expect the three values to be the same or at least 2 of them. I want to select the value that occurs the most under the presumption that they are not all different. My current attempt with a toy example is like thus:
mydf
a b c
0 1 1 2
1 3 3 3
2 1 3 3
3 4 5 4
3 4 5 5
mydft = mydf.transpose()
counts=[]
for col in mydft:
counts.append(mydft[col].value_counts())
I am then thinking of looping over counts and selecting the top value for each but this is very slow and feels anti pandas. I have also tried this:
truth = mydf['a'] == mydf['b']
with the intention of keeping rows which evaluate to true and doing something to those that do not but I have 1000s of NaN values in the real thing and apparently NaN == NaN
is False
. Any suggestions?
We can use mode
...
from scipy import stats
value,count=stats.mode(df.values,axis=1)
value
Out[180]:
array([[1],
[3],
[3],
[4],
[5]], dtype=int64)
count
Out[181]:
array([[2],
[3],
[2],
[2],
[2]])
After assign it back
df['new']=value
df
Out[183]:
a b c new
0 1 1 2 1
1 3 3 3 3
2 1 3 3 3
3 4 5 4 4
3 4 5 5 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With