I have a pandas data-frame:
id city
[email protected] Bangalore
[email protected] Mumbai
[email protected] Jamshedpur
[email protected] Jamshedpur
[email protected] Bangalore
[email protected] Mumbai
[email protected] Vijayawada
[email protected] Vijayawada
[email protected] Vijayawada
I want to find id-wise the maximum occurring city name. So that for a given id I can tell that - this is his favorite city:
id city
[email protected] Bangalore
[email protected] Vijayawada
[email protected] Jamshedpur
Using groupby id and city gives:
id city count
0 [email protected] Bangalore 2
1 [email protected] Mumbai 2
2 [email protected] Vijayawada 3
3 [email protected] Jamshedpur 2
How to proceed further? I believe some group-by apply will do that but unaware of what exactly will do the trick. So please suggest.
If some id has same count for two or three cities I am ok with returning any of those cities.
You can try double groupby
with size
and idxmax
. Output is list of tuples (because MultiIndex
), so use apply
:
df = df.groupby(['id','city']).size().groupby(level=0).idxmax()
.apply(lambda x: x[1]).reset_index(name='city')
Another solutions:
s = df.groupby(['id','city']).size()
df = s.loc[s.groupby(level=0).idxmax()].reset_index().drop(0,axis=1)
Or:
df = df.groupby(['id'])['city'].apply(lambda x: x.value_counts().index[0]).reset_index()
print (df)
id city
0 [email protected] Bangalore
1 [email protected] Vijayawada
2 [email protected] Jamshedpur
The recommended approach is groupby('id').apply(your_custom_function)
, where your_custom_function aggregates by 'city' and returns the max value (or as you mentioned, multiple max values). We don't even have to use .agg('city')
import pandas as pd
def get_top_city(g):
return g['city'].value_counts().idxmax()
df = pd.DataFrame.from_records(
[('[email protected]', 'Bangalore'), ('[email protected]', 'Mumbai'),
('[email protected]', 'Jamshedpur'),('[email protected]', 'Jamshedpur'),
('[email protected]', 'Bangalore'), ('[email protected]', 'Mumbai'),
('[email protected]', 'Vijayawada'),('[email protected]', 'Vijayawada'),
('[email protected]', 'Vijayawada')],
columns=['id','city'],
index=None
)
topdf = df.groupby('id').apply(get_top_city)
id
[email protected] Bangalore
[email protected] Vijayawada
[email protected] Jamshedpur
# or topdf.items()/iteritems() if you want as list of (id,city) tuples
[('[email protected]', 'Bangalore'), ('[email protected]', 'Vijayawada'), ('[email protected]', 'Jamshedpur')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With