Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby category, rating, get top value from each category?

First question on SO, very new to pandas and still a little shaky on the terminology: I'm trying to figure out the proper syntax/sequence of operations on a dataframe to be able to group by column B, find the max (or min) corresponding value for each group in column C, and retrieve the corresponding value for that in column A.

Suppose this is my dataframe:

name     type      votes     
bob       dog        10
pete      cat         8
fluffy    dog         5
max       cat         9

Using df.groupby('type').votes.agg('max') returns:

dog     10
cat      9

So far, so good. However, I'd like to figure out how to return this:

dog    10    bob
cat     9    max 

I've gotten as far as df.groupby(['type', 'votes']).name.agg('max'), though that returns

dog   5    fluffy
      10   bob
cat   8    pete
      9    max

... which is fine for this pretend dataframe, but doesn't quite help when working with a much larger one.

Thanks very much!

like image 364
user3817675 Avatar asked Jun 17 '15 00:06

user3817675


1 Answers

If df has an index with no duplicate values, then you can use idxmax to return the index of the maximum row for each group. Then use df.loc to select the entire row:

In [322]: df.loc[df.groupby('type').votes.agg('idxmax')]
Out[322]: 
  name type  votes
3  max  cat      9
0  bob  dog     10

If df.index has duplicate values, i.e. is not a unique index, then make the index unique first:

df = df.reset_index()

then use idxmax:

result = df.loc[df.groupby('type').votes.agg('idxmax')]

If you really need to, you can return df to its original state:

df = df.set_index(['index'], drop=True)

but in general life is much better with a unique index.


Here is an example showing what goes wrong when df does not have a unique index. Suppose the index is AABB:

import pandas as pd
df = pd.DataFrame({'name': ['bob', 'pete', 'fluffy', 'max'],
                   'type': ['dog', 'cat', 'dog', 'cat'],
                   'votes': [10, 8, 5, 9]}, 
                  index=list('AABB'))
print(df)
#      name type  votes
# A     bob  dog     10
# A    pete  cat      8
# B  fluffy  dog      5
# B     max  cat      9

idxmax returns the index values A and B:

print(df.groupby('type').votes.agg('idxmax'))
type
cat    B
dog    A
Name: votes, dtype: object

But A and B do not uniquely specify the desired rows. df.loc[...] returns all rows whose index value is A or B:

print(df.loc[df.groupby('type').votes.agg('idxmax')])
#      name type  votes
# B  fluffy  dog      5
# B     max  cat      9
# A     bob  dog     10
# A    pete  cat      8

In contrast, if we reset the index:

df = df.reset_index()
#   index    name type  votes
# 0     A     bob  dog     10
# 1     A    pete  cat      8
# 2     B  fluffy  dog      5
# 3     B     max  cat      9

then df.loc can be used to select the desired rows:

print(df.groupby('type').votes.agg('idxmax'))
# type
# cat    3
# dog    0
# Name: votes, dtype: int64

print(df.loc[df.groupby('type').votes.agg('idxmax')])
#   index name type  votes
# 3     B  max  cat      9
# 0     A  bob  dog     10
like image 171
unutbu Avatar answered Nov 19 '22 01:11

unutbu