I want to apply a sort and a limit within the groups returned by groupby
, as in this question. However, I have multiple aggregates and I want all of the aggregates to be retained in the result.
Here is a simple example:
products = ["A", "B", "C", "D"]
stores = ["foo", "bar", "baz"]
n = 30
product_list = [products[i] for i in np.random.randint(0, len(products), n)]
store_list = [stores[i] for i in np.random.randint(0, len(stores), n)]
rating_list = np.random.random(n) * 5
sales_list = np.random.random(n) * 10000
df = pd.DataFrame(
{'store': store_list,
'product': product_list,
'sales': sales_list,
'rating': rating_list})
df = df[['store', 'product', 'sales', 'rating']]
df[:5]
I want to group by store
and product
, and both sum
and count
the sales
, while taking the mean
of the rating
.
This is straightforward:
dfg = df.groupby(['store', 'product']).agg({'sales': ['sum', 'count'],
'rating': 'mean'})
Now, I want to only keep the top two rated rows in each group. I can get this as follows (using a multi-level extension of the somewhat unintuitive [to me] incantations from 1):
g = dfg[('rating', 'mean')].groupby(
level=0, group_keys=False).apply(
lambda x: x.sort_values(ascending=False).head(2))
g
This returns the following Series
:
store product
bar B 3.601135
A 1.867449
baz B 2.984196
D 2.780500
foo B 3.767912
D 3.129346
Name: (rating, mean), dtype: float64
But I've lost the ('sales', 'sum')
and ('sales', 'count')
columns.
I suspect I need to extract g.index
and use it in some way but have not been able to get this sorted (pun intended).
EDIT: The answers below manage to give the groups that I'm looking for, but what I'd really like is a stable sort where not only do I get the top N mean ratings within each group, but the groups themselves are sorted so that the first group has the highest rating, etc. To some extent this is just icing on the cake as I have the values I want now and would just like the report to be prettier.
I've sorted it. Instead of indexing the grouped table and doing the subsequent groupby
and sort_values
as above, I needed to apply the sort_values
to the un-indexed DataFrame
, specifying the column to sort on explicitly:
g = dfg.groupby(level=0, group_keys=False).apply(
lambda x: x.sort_values(('rating', 'mean'), ascending=False).head(2))
Giving me the desired result:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With