In Python, I have a pandas DataFrame similar to the following:
Item | shop1 | shop2 | shop3 | Category ------------------------------------ Shoes| 45 | 50 | 53 | Clothes TV | 200 | 300 | 250 | Technology Book | 20 | 17 | 21 | Books phone| 300 | 350 | 400 | Technology
Where shop1, shop2 and shop3 are the costs of every item in different shops. Now, I need to return a DataFrame, after some data cleaning, like this one:
Category (index)| size| sum| mean | std ----------------------------------------
where size is the number of items in each Category and sum, mean and std are related to the same functions applied to the 3 shops. How can I do these operations with the split-apply-combine pattern (groupby, aggregate, apply,...) ?
Can someone help me out? I'm going crazy with this one...thank you!
What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Both pivot_table and groupby are used to aggregate your dataframe. The difference is only with regard to the shape of the result.
When as_index=True the key(s) you use in groupby() will become an index in the new dataframe. The benefits you get when you set the column as index are: Speed. When you filter values based on the index column eg.
Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.
We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.
rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std') df.set_index(['Category', 'Item']).stack().groupby('Category') \ .agg(rnm_cols.keys()).rename(columns=rnm_cols) Size Sum Mean Std Category Books 3 58 19.333333 2.081666 Clothes 3 148 49.333333 4.041452 Technology 6 1800 300.000000 70.710678
option 1
use agg
← link to docs
agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std') df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs) Std Sum Mean Size Category Books 2.081666 58 19.333333 3 Clothes 4.041452 148 49.333333 3 Technology 70.710678 1800 300.000000 6
option 2
more for less
use describe
← link to docs
df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack() count mean std min 25% 50% 75% max Category Books 3.0 19.333333 2.081666 17.0 18.5 20.0 20.5 21.0 Clothes 3.0 49.333333 4.041452 45.0 47.5 50.0 51.5 53.0 Technology 6.0 300.000000 70.710678 200.0 262.5 300.0 337.5 400.0
df.groupby('Category').agg({'Item':'size','shop1':['sum','mean','std'],'shop2':['sum','mean','std'],'shop3':['sum','mean','std']})
Or if you want it across all shops then:
df1 = df.set_index(['Item','Category']).stack().reset_index().rename(columns={'level_2':'Shops',0:'costs'}) df1.groupby('Category').agg({'Item':'size','costs':['sum','mean','std']})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With