How do I find all rows in a pandas DataFrame which have the max value for count
column, after grouping by ['Sp','Mt']
columns?
Example 1: the following DataFrame, which I group by ['Sp','Mt']
:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output is to get the result rows whose count is max in each group, like this:
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2: this DataFrame, which I group by ['Sp','Mt']
:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Expected output is to get all the rows where count
equals max in each group like this:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
To get the maximum value of each group, you can directly apply the pandas max() function to the selected column(s) from the result of pandas groupby.
max. Compute max of group values. Include only float, int, boolean columns.
You can use pandas DataFrame. groupby(). count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1 3
S3 8
S4 10
S2 7
Name: count
To get the indices of the original DF you can do:
In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Note that if you have multiple max values per group, all will be returned.
Update
On a hail mary chance that this is what the OP is requesting:
In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 8
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7
You can sort the dataFrame by count and then remove duplicates. I think it's easier:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
Realizing that "applying" "nlargest" to groupby object works just as fine:
Additional advantage - also can fetch top n values if required:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi
You may not need to do groupby()
, but use both sort_values
+ drop_duplicates
df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Also almost same logic by using tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
Use groupby
and idxmax
methods:
transfer col date
to datetime
:
df['date']=pd.to_datetime(df['date'])
get the index of max
of column date
, after groupyby ad_id
:
idx=df.groupby(by='ad_id')['date'].idxmax()
get the wanted data:
df_max=df.loc[idx,]
Out[54]:
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22
For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
Summarizing, there are many ways, but which one is faster?
import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))
start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))
And the winner is...
Try using "nlargest" on the groupby object. The advantage of using nlargest is that it returns the index of the rows where "the nlargest item(s)" were fetched from. Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)).
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max
df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With