Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: select most frequent using group by [duplicate]

Tags:

python

pandas

How do I get the most frequently occurring category (mode) for each tag in Python?

+------------------+----------+
| tag              | category |
+------------------+----------+
| automotive       |        8 |
| ba               |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| bath             |        9 |
+-----------------------------+

expected output would be like

    tag     | category 
------------+-----------
 ba         |        8      
 automotive |        8      
 bananatree |        8        
 bath       |        9    
 bamboo     |        8 

I have borrowed table from Stephen J. Fuhry and edited output from David Fuhry on MySQL SELECT most frequent by group due to confidentiality of my data set.

like image 722
bensw Avatar asked May 16 '14 04:05

bensw


1 Answers

In the comments you note you're using pandas. You can do something like the following:

>>> df

           tag  category
0    automotive         8
1            ba         8
2        bamboo         8
3        bamboo         8
4        bamboo         8
5        bamboo         8
6        bamboo         8
7        bamboo        10
8        bamboo         8
9        bamboo         9
10       bamboo         8
11       bamboo        10
12       bamboo         8
13       bamboo         9
14       bamboo         8
15  banana tree         8
16  banana tree         8
17  banana tree         8
18  banana tree         8
19         bath         9

Do a groupby on 'tag' for the 'category' column and then within each group use the mode method. However, we have to make it a conditional because pandas doesn't return a number for the mode if the number of observations is less than 3 (we can just return the group itself in the special cases of 1 or 2 observations in a group). We can use the aggregate/agg method with a lambda function to do this:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive     8
ba             8
bamboo         8
banana tree    8
bath           9

Note, when the mode is multi-modal you will get a array (numpy). For example, suppose there were two entries for bath (all the other data is the same):

tag|category
bath|9
bath|10

In that case the output would be:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive           8
ba                   8
bamboo               8
banana tree          8
bath           [9, 10]

You can also use the value_counts method instead of mode. Once again, do a groupby on 'tag' for the 'category' column and then within each group use the value_counts method. value_counts arranges in descending order so you want to grab the index of the first row:

>>> df.groupby('tag')['category'].agg(lambda x: x.value_counts().index[0])

tag
automotive     8
ba             8
bamboo         8
banana tree    8
bath           9

However, this won't return an array in multi-modal situations. It will just return the first mode.

like image 62
Karl D. Avatar answered Nov 11 '22 18:11

Karl D.