Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

count occurences in each dataframe row then create column with most frequent

I am trying to compare the three floats in a row of a dataframe that is (500000x3), I expect the three values to be the same or at least 2 of them. I want to select the value that occurs the most under the presumption that they are not all different. My current attempt with a toy example is like thus:

mydf
   a  b  c
0  1  1  2
1  3  3  3
2  1  3  3
3  4  5  4
3  4  5  5



mydft = mydf.transpose()
    counts=[]
    for col in mydft:
        counts.append(mydft[col].value_counts())

I am then thinking of looping over counts and selecting the top value for each but this is very slow and feels anti pandas. I have also tried this:

truth = mydf['a'] == mydf['b']

with the intention of keeping rows which evaluate to true and doing something to those that do not but I have 1000s of NaN values in the real thing and apparently NaN == NaN is False. Any suggestions?

like image 617
seanysull Avatar asked Mar 08 '23 08:03

seanysull


1 Answers

We can use mode...

from scipy import stats


value,count=stats.mode(df.values,axis=1)
value
Out[180]: 
array([[1],
       [3],
       [3],
       [4],
       [5]], dtype=int64)


count
Out[181]: 
array([[2],
       [3],
       [2],
       [2],
       [2]])

After assign it back

df['new']=value
df
Out[183]: 
   a  b  c  new
0  1  1  2    1
1  3  3  3    3
2  1  3  3    3
3  4  5  4    4
3  4  5  5    5
like image 59
BENY Avatar answered Apr 30 '23 18:04

BENY