Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby sort within groups retaining multiple aggregates

I want to apply a sort and a limit within the groups returned by groupby, as in this question. However, I have multiple aggregates and I want all of the aggregates to be retained in the result.

Here is a simple example:

products = ["A", "B", "C", "D"]
stores = ["foo", "bar", "baz"]
n = 30

product_list = [products[i] for i in np.random.randint(0, len(products), n)]
store_list = [stores[i] for i in np.random.randint(0, len(stores), n)]
rating_list = np.random.random(n) * 5
sales_list = np.random.random(n) * 10000

df = pd.DataFrame(
    {'store': store_list, 
     'product': product_list, 
     'sales': sales_list, 
     'rating': rating_list})

df = df[['store', 'product', 'sales', 'rating']]

df[:5]

ungrouped dataframe

I want to group by store and product, and both sum and count the sales, while taking the mean of the rating.

This is straightforward:

dfg = df.groupby(['store', 'product']).agg({'sales': ['sum', 'count'], 
                                            'rating': 'mean'})

grouped

Now, I want to only keep the top two rated rows in each group. I can get this as follows (using a multi-level extension of the somewhat unintuitive [to me] incantations from 1):

g = dfg[('rating', 'mean')].groupby(
      level=0, group_keys=False).apply(
        lambda x: x.sort_values(ascending=False).head(2))
g

This returns the following Series:

store  product
bar    B          3.601135
       A          1.867449
baz    B          2.984196
       D          2.780500
foo    B          3.767912
       D          3.129346
Name: (rating, mean), dtype: float64

But I've lost the ('sales', 'sum') and ('sales', 'count') columns.

I suspect I need to extract g.index and use it in some way but have not been able to get this sorted (pun intended).

EDIT: The answers below manage to give the groups that I'm looking for, but what I'd really like is a stable sort where not only do I get the top N mean ratings within each group, but the groups themselves are sorted so that the first group has the highest rating, etc. To some extent this is just icing on the cake as I have the values I want now and would just like the report to be prettier.

like image 515
sfjac Avatar asked Jan 23 '18 02:01

sfjac


1 Answers

I've sorted it. Instead of indexing the grouped table and doing the subsequent groupby and sort_values as above, I needed to apply the sort_values to the un-indexed DataFrame, specifying the column to sort on explicitly:

g = dfg.groupby(level=0, group_keys=False).apply(
      lambda x: x.sort_values(('rating', 'mean'), ascending=False).head(2))

Giving me the desired result:

enter image description here

like image 119
sfjac Avatar answered Oct 09 '22 08:10

sfjac