How do I find all rows in a pandas DataFrame which have the max value for <code>count</code> column, after grouping by <code>['Sp','Mt']</code> columns? Example 1: the following DataFrame, which I group by <code>['Sp','Mt']</code>: <pre class="prettyprint"><code> Sp Mt Value count 0 MM1 S1 a **3** 1 MM1 S1 n 2 2 MM1 S3 cb **5** 3 MM2 S3 mk **8** 4 MM2 S4 bg **10** 5 MM2 S4 dgd 1 6 MM4 S2 rd 2 7 MM4 S2 cb 2 8 MM4 S2 uyi **7** </code></pre> Expected output is to get the result rows whose count is max in each group, like this: <pre class="prettyprint"><code>0 MM1 S1 a **3** 2 MM1 S3 cb **5** 3 MM2 S3 mk **8** 4 MM2 S4 bg **10** 8 MM4 S2 uyi **7** </code></pre> Example 2: this DataFrame, which I group by <code>['Sp','Mt']</code>: <pre class="prettyprint"><code> Sp Mt Value count 4 MM2 S4 bg 10 5 MM2 S4 dgd 1 6 MM4 S2 rd 2 7 MM4 S2 cb 8 8 MM4 S2 uyi 8 </code></pre> Expected output is to get all the rows where <code>count</code> equals max in each group like this: <pre class="prettyprint"><code> Sp Mt Value count 4 MM2 S4 bg 10 7 MM4 S2 cb 8 8 MM4 S2 uyi 8 </code></pre>

<pre class="prettyprint"><code>In [1]: df Out[1]: Sp Mt Value count 0 MM1 S1 a 3 1 MM1 S1 n 2 2 MM1 S3 cb 5 3 MM2 S3 mk 8 4 MM2 S4 bg 10 5 MM2 S4 dgd 1 6 MM4 S2 rd 2 7 MM4 S2 cb 2 8 MM4 S2 uyi 7 In [2]: df.groupby(['Mt'], sort=False)['count'].max() Out[2]: Mt S1 3 S3 8 S4 10 S2 7 Name: count </code></pre> To get the indices of the original DF you can do: <pre class="prettyprint"><code>In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count'] In [4]: df[idx] Out[4]: Sp Mt Value count 0 MM1 S1 a 3 3 MM2 S3 mk 8 4 MM2 S4 bg 10 8 MM4 S2 uyi 7 </code></pre> Note that if you have multiple max values per group, all will be returned. Update On a hail mary chance that this is what the OP is requesting: <pre class="prettyprint"><code>In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max) In [6]: df Out[6]: Sp Mt Value count count_max 0 MM1 S1 a 3 3 1 MM1 S1 n 2 3 2 MM1 S3 cb 5 8 3 MM2 S3 mk 8 8 4 MM2 S4 bg 10 10 5 MM2 S4 dgd 1 10 6 MM4 S2 rd 2 7 7 MM4 S2 cb 2 7 8 MM4 S2 uyi 7 7 </code></pre>

You can sort the dataFrame by count and then remove duplicates. I think it's easier: <pre class="prettyprint"><code>df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt']) </code></pre>

Use <code>groupby</code> and <code>idxmax</code> methods: <ol> <li> transfer col <code>date</code> to <code>datetime</code>: <pre class="prettyprint"><code>df['date']=pd.to_datetime(df['date']) </code></pre> </li> <li> get the index of <code>max</code> of column <code>date</code>, after <code>groupyby ad_id</code>: <pre class="prettyprint"><code>idx=df.groupby(by='ad_id')['date'].idxmax() </code></pre> </li> <li> get the wanted data: <pre class="prettyprint"><code>df_max=df.loc[idx,] </code></pre> </li> </ol> Out[54]: <pre class="prettyprint"><code>ad_id price date 7 22 2 2018-06-11 6 23 2 2018-06-22 2 24 2 2018-06-30 3 28 5 2018-06-22 </code></pre>

Try using "nlargest" on the groupby object. The advantage of using nlargest is that it returns the index of the rows where "the nlargest item(s)" were fetched from. Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)). <pre class="prettyprint"><code>df = pd.DataFrame({ 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'], 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'], 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'], 'count' : [3,2,5,8,10,1,2,2,7] }) d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension </code></pre> <img src="https://i.stack.imgur.com/F2h7N.png" alt="enter image description here">

Get the row(s) which have the max value in groups using groupby

Tags:

python

pandas

dataframe

group-by

max

pandas-groupby

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?

Example 1: the following DataFrame, which I group by ['Sp','Mt']:

   Sp   Mt Value   count
0  MM1  S1   a     **3**
1  MM1  S1   n       2
2  MM1  S3   cb    **5**
3  MM2  S3   mk    **8**
4  MM2  S4   bg    **10**
5  MM2  S4   dgd     1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2   uyi   **7**

Expected output is to get the result rows whose count is max in each group, like this:

0  MM1  S1   a      **3**
2  MM1  S3   cb     **5**
3  MM2  S3   mk     **8**
4  MM2  S4   bg     **10** 
8  MM4  S2   uyi    **7**

Example 2: this DataFrame, which I group by ['Sp','Mt']:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

Expected output is to get all the rows where count equals max in each group like this:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

352

asked Mar 29 '13 14:03

jojo12

9 Answers

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7

180

answered Sep 29 '22 10:09

Zelazny7

You can sort the dataFrame by count and then remove duplicates. I think it's easier:

df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])

answered Sep 29 '22 10:09

Rani

Realizing that "applying" "nlargest" to groupby object works just as fine:

Additional advantage - also can fetch top n values if required:

In [85]: import pandas as pd

In [86]: df = pd.DataFrame({
    ...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
    ...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    ...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    ...: 'count' : [3,2,5,8,10,1,2,2,7]
    ...: })

## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
   count  mt   sp  val
0      3  S1  MM1    a
1      5  S3  MM1   cb
2      8  S3  MM2   mk
3     10  S4  MM2   bg
4      7  S2  MM4  uyi

answered Sep 29 '22 10:09

Surya

You may not need to do groupby(), but use both sort_values + drop_duplicates

df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

Also almost same logic by using tail

df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

answered Oct 02 '22 10:10

BENY

Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.

df = pd.DataFrame({
    'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
    'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count' : [3,2,5,8,10,1,2,2,7]
    })

df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})

df_grouped = df_grouped.reset_index()

df_grouped = df_grouped.rename(columns={'count':'count_max'})

df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])

df = df[df['count'] == df['count_max']]

answered Oct 03 '22 10:10

landewednack

Use groupby and idxmax methods:

transfer col date to datetime:
```
df['date']=pd.to_datetime(df['date'])
```
get the index of max of column date, after groupyby ad_id:
```
idx=df.groupby(by='ad_id')['date'].idxmax()
```
get the wanted data:
```
df_max=df.loc[idx,]
```

Out[54]:

ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22

answered Oct 03 '22 10:10

blueear

For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :

df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]

answered Oct 03 '22 10:10

PAC

Summarizing, there are many ways, but which one is faster?

import pandas as pd
import numpy as np
import time

df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))

start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))

And the winner is...

--1 ) 0.03337574005126953 seconds ---
--2 ) 0.1346898078918457 seconds ---
--3 ) 0.10243558883666992 seconds ---
--3b) 0.1004343032836914 seconds ---
--4 ) 0.028397560119628906 seconds ---
--5 ) 0.07552886009216309 seconds ---

answered Oct 03 '22 10:10

Mauro Mascia

Try using "nlargest" on the groupby object. The advantage of using nlargest is that it returns the index of the rows where "the nlargest item(s)" were fetched from. Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)).

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max

df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension

enter image description here

answered Oct 02 '22 10:10

Kweweli

Related questions
                            
                                Print current call stack from a method in Python code
                            
                                bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
                            
                                Why are scripting languages (e.g. Perl, Python, Ruby) not suitable as shell languages? [closed]
                            
                                Converting datetime.date to UTC timestamp in Python
                            
                                Saving a Numpy array as an image
                            
                                How do I get user IP address in django?
                            
                                What's the best way to parse a JSON response from the requests library?
                            
                                Difference between 'cls' and 'self' in Python classes?
                            
                                multiprocessing.Pool: When to use apply, apply_async or map?
                            
                                Delete a dictionary item if the key exists [duplicate]
                            
                                Matplotlib scatter plot with different text at each data point
                            
                                Get a filtered list of files in a directory
                            
                                In Python, if I return inside a "with" block, will the file still close?
                            
                                How to test if a dictionary contains a specific key? [duplicate]
                            
                                Python idiom to return first item or None
                            
                                Pandas index column title or name
                            
                                Loop backwards using indices in Python?
                            
                                "pip install unroll": "python setup.py egg_info" failed with error code 1
                            
                                How to use filter, map, and reduce in Python 3
                            
                                What does asterisk * mean in Python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get the row(s) which have the max value in groups using groupby

Tags:

python

pandas

dataframe

group-by

max

pandas-groupby

jojo12

People also ask

9 Answers

Zelazny7

Rani

Surya

BENY

landewednack

blueear

PAC

Mauro Mascia

Kweweli

Recent Activity

Donate For Us