How do I get the most frequently occurring category (mode) for each tag in Python?
+------------------+----------+
| tag | category |
+------------------+----------+
| automotive | 8 |
| ba | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 10 |
| bamboo | 8 |
| bamboo | 9 |
| bamboo | 8 |
| bamboo | 10 |
| bamboo | 8 |
| bamboo | 9 |
| bamboo | 8 |
| banana tree | 8 |
| banana tree | 8 |
| banana tree | 8 |
| banana tree | 8 |
| bath | 9 |
+-----------------------------+
expected output would be like
tag | category
------------+-----------
ba | 8
automotive | 8
bananatree | 8
bath | 9
bamboo | 8
I have borrowed table from Stephen J. Fuhry and edited output from David Fuhry on MySQL SELECT most frequent by group due to confidentiality of my data set.
In the comments you note you're using pandas
. You can do something like the following:
>>> df
tag category
0 automotive 8
1 ba 8
2 bamboo 8
3 bamboo 8
4 bamboo 8
5 bamboo 8
6 bamboo 8
7 bamboo 10
8 bamboo 8
9 bamboo 9
10 bamboo 8
11 bamboo 10
12 bamboo 8
13 bamboo 9
14 bamboo 8
15 banana tree 8
16 banana tree 8
17 banana tree 8
18 banana tree 8
19 bath 9
Do a groupby
on 'tag' for the 'category' column and then within each group use the mode
method. However, we have to make it a conditional because pandas
doesn't return a number for the mode
if the number of observations is less than 3 (we can just return the group itself in the special cases of 1 or 2 observations in a group). We can use the aggregate/agg
method with a lambda function to do this:
>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)
tag
automotive 8
ba 8
bamboo 8
banana tree 8
bath 9
Note, when the mode is multi-modal you will get a array (numpy). For example, suppose there were two entries for bath (all the other data is the same):
tag|category
bath|9
bath|10
In that case the output would be:
>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)
tag
automotive 8
ba 8
bamboo 8
banana tree 8
bath [9, 10]
You can also use the value_counts
method instead of mode
. Once again, do a groupby
on 'tag' for the 'category' column and then within each group use the value_counts
method. value_counts
arranges in descending order so you want to grab the index of the first row:
>>> df.groupby('tag')['category'].agg(lambda x: x.value_counts().index[0])
tag
automotive 8
ba 8
bamboo 8
banana tree 8
bath 9
However, this won't return an array in multi-modal situations. It will just return the first mode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With