Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the most frequent row in table

How to get the most frequent row in a DataFrame? For example, if I have the following table:

   col_1  col_2 col_3
0      1      1     A
1      1      0     A
2      0      1     A
3      1      1     A
4      1      0     B
5      1      0     C

Expected result:

   col_1  col_2 col_3
0      1      1     A

EDIT: I need the most frequent row (as one unit) and not the most frequent column value that can be calculated with the mode() method.

like image 683
Mykola Zotko Avatar asked Sep 28 '20 14:09

Mykola Zotko


4 Answers

Check groupby

df.groupby(df.columns.tolist()).size().sort_values().tail(1).reset_index().drop(0,1)
   col_1  col_2 col_3  
0      1      1     A  
like image 109
BENY Avatar answered Oct 05 '22 03:10

BENY


With NumPy's np.unique -

In [92]: u,idx,c = np.unique(df.values.astype(str), axis=0, return_index=True, return_counts=True)

In [99]: df.iloc[[idx[c.argmax()]]]
Out[99]: 
   col_1  col_2 col_3
0      1      1     A

If you are looking for performance, convert the string column to numeric and then use np.unique -

a = np.c_[df.col_1, df.col_2, pd.factorize(df.col_3)[0]]
u,idx,c = np.unique(a, axis=0, return_index=True, return_counts=True)
like image 38
Divakar Avatar answered Oct 05 '22 03:10

Divakar


You can do this with groupby and size:

df = df.groupby(df.columns.tolist(),as_index=False).size()
result = df.iloc[[df["size"].idxmax()]].drop(["size"], axis=1)
result.reset_index(drop=True) #this is just to reset the index
like image 43
DDD1 Avatar answered Oct 05 '22 02:10

DDD1


npi_indexed library helps to perform some actions on 'groupby' type of problems with less script and similar performance as numpy. So this is alternative and pretty similar way to @Divakar's np.unique() based solution:

arr = df.values.astype(str)
idx = npi.multiplicity(arr)
output = df.iloc[[idx[c.argmax()]]]
like image 31
mathfux Avatar answered Oct 05 '22 04:10

mathfux