I have a dataframe like
ID_0 ID_1 ID_2
0 a b 1
1 a c 1
2 a b 0
3 d c 0
4 a c 0
5 a c 1
I would like to groupby ['ID_0','ID_1'] and produce a new dataframe which has the sum of the ID_2 values for each group divided by the number of rows in each group.
grouped = df.groupby(['ID_0', 'ID_1'])
print grouped.agg({'ID_2': np.sum}), "\n", grouped.size()
gives
ID_2
ID_0 ID_1
a b 1
c 2
d c 0
ID_0 ID_1
a b 2
c 3
d c 1
dtype: int64
How can I get the new dataframe with the np.sum values divided by the size() values?
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
Pandas dataframe. groupby() function is one of the most useful function in the library it splits the data into groups based on columns/conditions and then apply some operations eg. size() which counts the number of entries/rows in each group.
The simple division (/) operator is the first way to divide two columns. You will split the First Column with the other columns here. This is the simplest method of dividing two columns in Pandas.
Use groupby.apply
instead:
df.groupby(['ID_0', 'ID_1']).apply(lambda x: x['ID_2'].sum()/len(x))
ID_0 ID_1
a b 0.500000
c 0.666667
d c 0.000000
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With