Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the row(s) which have the max value in groups using groupby

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?

Example 1: the following DataFrame, which I group by ['Sp','Mt']:

   Sp   Mt Value   count
0  MM1  S1   a     **3**
1  MM1  S1   n       2
2  MM1  S3   cb    **5**
3  MM2  S3   mk    **8**
4  MM2  S4   bg    **10**
5  MM2  S4   dgd     1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2   uyi   **7**

Expected output is to get the result rows whose count is max in each group, like this:

0  MM1  S1   a      **3**
2  MM1  S3   cb     **5**
3  MM2  S3   mk     **8**
4  MM2  S4   bg     **10** 
8  MM4  S2   uyi    **7**

Example 2: this DataFrame, which I group by ['Sp','Mt']:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

Expected output is to get all the rows where count equals max in each group like this:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
7  MM4  S2   cb     8
8  MM4  S2   uyi    8
like image 352
jojo12 Avatar asked Mar 29 '13 14:03

jojo12


People also ask

How do you get the maximum values of each group in a pandas?

To get the maximum value of each group, you can directly apply the pandas max() function to the selected column(s) from the result of pandas groupby.

What does Groupby Max do?

max. Compute max of group values. Include only float, int, boolean columns.

How do I count the number of rows in each group of a Groupby object?

You can use pandas DataFrame. groupby(). count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.


9 Answers

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7
like image 180
Zelazny7 Avatar answered Sep 29 '22 10:09

Zelazny7


You can sort the dataFrame by count and then remove duplicates. I think it's easier:

df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
like image 37
Rani Avatar answered Sep 29 '22 10:09

Rani


Realizing that "applying" "nlargest" to groupby object works just as fine:

Additional advantage - also can fetch top n values if required:

In [85]: import pandas as pd

In [86]: df = pd.DataFrame({
    ...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
    ...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    ...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    ...: 'count' : [3,2,5,8,10,1,2,2,7]
    ...: })

## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
   count  mt   sp  val
0      3  S1  MM1    a
1      5  S3  MM1   cb
2      8  S3  MM2   mk
3     10  S4  MM2   bg
4      7  S2  MM4  uyi
like image 32
Surya Avatar answered Sep 29 '22 10:09

Surya


You may not need to do groupby(), but use both sort_values + drop_duplicates

df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

Also almost same logic by using tail

df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10
like image 30
BENY Avatar answered Oct 02 '22 10:10

BENY


Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.

df = pd.DataFrame({
    'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
    'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count' : [3,2,5,8,10,1,2,2,7]
    })

df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})

df_grouped = df_grouped.reset_index()

df_grouped = df_grouped.rename(columns={'count':'count_max'})

df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])

df = df[df['count'] == df['count_max']]
like image 22
landewednack Avatar answered Oct 03 '22 10:10

landewednack


Use groupby and idxmax methods:

  1. transfer col date to datetime:

    df['date']=pd.to_datetime(df['date'])
    
  2. get the index of max of column date, after groupyby ad_id:

    idx=df.groupby(by='ad_id')['date'].idxmax()
    
  3. get the wanted data:

    df_max=df.loc[idx,]
    

Out[54]:

ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22
like image 36
blueear Avatar answered Oct 03 '22 10:10

blueear


For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :

df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
like image 43
PAC Avatar answered Oct 03 '22 10:10

PAC


Summarizing, there are many ways, but which one is faster?

import pandas as pd
import numpy as np
import time

df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))

start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))

And the winner is...

  • --1 ) 0.03337574005126953 seconds ---
  • --2 ) 0.1346898078918457 seconds ---
  • --3 ) 0.10243558883666992 seconds ---
  • --3b) 0.1004343032836914 seconds ---
  • --4 ) 0.028397560119628906 seconds ---
  • --5 ) 0.07552886009216309 seconds ---
like image 26
Mauro Mascia Avatar answered Oct 03 '22 10:10

Mauro Mascia


Try using "nlargest" on the groupby object. The advantage of using nlargest is that it returns the index of the rows where "the nlargest item(s)" were fetched from. Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)).

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max

df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension

enter image description here

like image 38
Kweweli Avatar answered Oct 02 '22 10:10

Kweweli