count occurences in each dataframe row then create column with most frequent

Question

I am trying to compare the three floats in a row of a dataframe that is (500000x3), I expect the three values to be the same or at least 2 of them. I want to select the value that occurs the most under the presumption that they are not all different. My current attempt with a toy example is like thus:

mydf
   a  b  c
0  1  1  2
1  3  3  3
2  1  3  3
3  4  5  4
3  4  5  5



mydft = mydf.transpose()
    counts=[]
    for col in mydft:
        counts.append(mydft[col].value_counts())

I am then thinking of looping over counts and selecting the top value for each but this is very slow and feels anti pandas. I have also tried this:

truth = mydf['a'] == mydf['b']

with the intention of keeping rows which evaluate to true and doing something to those that do not but I have 1000s of NaN values in the real thing and apparently NaN == NaN is False. Any suggestions?

BENY · Accepted Answer

We can use mode...

from scipy import stats


value,count=stats.mode(df.values,axis=1)
value
Out[180]: 
array([[1],
       [3],
       [3],
       [4],
       [5]], dtype=int64)


count
Out[181]: 
array([[2],
       [3],
       [2],
       [2],
       [2]])

After assign it back

df['new']=value
df
Out[183]: 
   a  b  c  new
0  1  1  2    1
1  3  3  3    3
2  1  3  3    3
3  4  5  4    4
3  4  5  5    5

count occurences in each dataframe row then create column with most frequent

Tags:

python

pandas

vectorization

seanysull

1 Answers

BENY

Recent Activity

Donate For Us

count occurences in each dataframe row then create column with most frequent

Tags:

python

pandas

vectorization

seanysull

1 Answers

BENY

Related questions

Recent Activity

Donate For Us