Python: select most frequent using group by [duplicate]

Question

How do I get the most frequently occurring category (mode) for each tag in Python?

+------------------+----------+
| tag              | category |
+------------------+----------+
| automotive       |        8 |
| ba               |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| bath             |        9 |
+-----------------------------+

expected output would be like

    tag     | category 
------------+-----------
 ba         |        8      
 automotive |        8      
 bananatree |        8        
 bath       |        9    
 bamboo     |        8

I have borrowed table from Stephen J. Fuhry and edited output from David Fuhry on MySQL SELECT most frequent by group due to confidentiality of my data set.

Karl D. · Accepted Answer

In the comments you note you're using pandas. You can do something like the following:

>>> df

           tag  category
0    automotive         8
1            ba         8
2        bamboo         8
3        bamboo         8
4        bamboo         8
5        bamboo         8
6        bamboo         8
7        bamboo        10
8        bamboo         8
9        bamboo         9
10       bamboo         8
11       bamboo        10
12       bamboo         8
13       bamboo         9
14       bamboo         8
15  banana tree         8
16  banana tree         8
17  banana tree         8
18  banana tree         8
19         bath         9

Do a groupby on 'tag' for the 'category' column and then within each group use the mode method. However, we have to make it a conditional because pandas doesn't return a number for the mode if the number of observations is less than 3 (we can just return the group itself in the special cases of 1 or 2 observations in a group). We can use the aggregate/agg method with a lambda function to do this:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive     8
ba             8
bamboo         8
banana tree    8
bath           9

Note, when the mode is multi-modal you will get a array (numpy). For example, suppose there were two entries for bath (all the other data is the same):

tag|category
bath|9
bath|10

In that case the output would be:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive           8
ba                   8
bamboo               8
banana tree          8
bath           [9, 10]

You can also use the value_counts method instead of mode. Once again, do a groupby on 'tag' for the 'category' column and then within each group use the value_counts method. value_counts arranges in descending order so you want to grab the index of the first row:

>>> df.groupby('tag')['category'].agg(lambda x: x.value_counts().index[0])

tag
automotive     8
ba             8
bamboo         8
banana tree    8
bath           9

However, this won't return an array in multi-modal situations. It will just return the first mode.

Python: select most frequent using group by [duplicate]

Tags:

python

pandas

bensw

1 Answers

Karl D.

Recent Activity

Donate For Us

Python: select most frequent using group by [duplicate]

Tags:

python

pandas

bensw

1 Answers

Karl D.

Related questions

Recent Activity

Donate For Us