Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding max occurrence of a column's value, after group-by on another column

I have a pandas data-frame:

        id                city
 [email protected]   Bangalore
 [email protected]       Mumbai
[email protected]      Jamshedpur
[email protected]      Jamshedpur
[email protected]    Bangalore
  [email protected]      Mumbai
  [email protected]      Vijayawada
  [email protected]      Vijayawada
  [email protected]      Vijayawada

I want to find id-wise the maximum occurring city name. So that for a given id I can tell that - this is his favorite city:

         id             city
[email protected]   Bangalore
[email protected]       Vijayawada
[email protected]     Jamshedpur

Using groupby id and city gives:

         id                   city       count
0  [email protected]       Bangalore    2
1      [email protected]        Mumbai      2
2      [email protected]      Vijayawada    3
3    [email protected]      Jamshedpur    2

How to proceed further? I believe some group-by apply will do that but unaware of what exactly will do the trick. So please suggest.

If some id has same count for two or three cities I am ok with returning any of those cities.

like image 369
Satya Avatar asked Mar 23 '16 09:03

Satya


2 Answers

You can try double groupby with size and idxmax. Output is list of tuples (because MultiIndex), so use apply:

df = df.groupby(['id','city']).size().groupby(level=0).idxmax()
                              .apply(lambda x: x[1]).reset_index(name='city')

Another solutions:

s = df.groupby(['id','city']).size()
df = s.loc[s.groupby(level=0).idxmax()].reset_index().drop(0,axis=1)

Or:

df = df.groupby(['id'])['city'].apply(lambda x: x.value_counts().index[0]).reset_index()

print (df)
                     id        city
0  [email protected]   Bangalore
1      [email protected]  Vijayawada
2    [email protected]  Jamshedpur
like image 133
jezrael Avatar answered Nov 14 '22 21:11

jezrael


The recommended approach is groupby('id').apply(your_custom_function), where your_custom_function aggregates by 'city' and returns the max value (or as you mentioned, multiple max values). We don't even have to use .agg('city')

import pandas as pd

def get_top_city(g):
    return g['city'].value_counts().idxmax()    

df = pd.DataFrame.from_records(
         [('[email protected]', 'Bangalore'), ('[email protected]',     'Mumbai'),
         ('[email protected]',   'Jamshedpur'),('[email protected]',   'Jamshedpur'),
         ('[email protected]', 'Bangalore'), ('[email protected]',     'Mumbai'),
         ('[email protected]',     'Vijayawada'),('[email protected]',     'Vijayawada'),
         ('[email protected]',     'Vijayawada')],
         columns=['id','city'],
         index=None
     )

topdf = df.groupby('id').apply(get_top_city)

id
[email protected]     Bangalore
[email protected]        Vijayawada
[email protected]      Jamshedpur

# or topdf.items()/iteritems() if you want as list of (id,city) tuples

[('[email protected]', 'Bangalore'), ('[email protected]', 'Vijayawada'), ('[email protected]', 'Jamshedpur')]
like image 44
smci Avatar answered Nov 14 '22 21:11

smci