Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Aggregate of DataFrameGroupby

I am working with this data set census.csv

Code:

df = pd.read_csv('Data/census.csv')
df = df[df['SUMLEV']==50]

print(df.head())

This is how my data looks like:

   SUMLEV  REGION  DIVISION  ...  RNETMIG2013  RNETMIG2014 RNETMIG2015
1      50       3         6  ...    -2.722002     2.592270   -2.187333
2      50       3         6  ...    22.727626    20.317142   18.293499
3      50       3         6  ...    -7.167664    -3.978583  -10.543299
4      50       3         6  ...    -5.403729     0.754533    1.107861
5      50       3         6  ...    -1.402476    -1.577232   -0.884411

I wanted to take the aggregate of two columns after grouping by 'STNAME':

(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg(Avg= np.average, Sum = np.sum))

Error:

----> 3 (df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg(Avg= np.average, Sum = np.sum))

f:\software_installations\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
    922         elif func is None:
    923             # nicer error message
--> 924             raise TypeError("Must provide 'func' or tuples of '(column, aggfunc).")
    925 
    926         func = _maybe_mangle_lambdas(func)

TypeError: Must provide 'func' or tuples of '(column, aggfunc).
like image 593
DevPy Avatar asked Dec 30 '22 23:12

DevPy


2 Answers

While others were fast to give the one-liner code snippet, I try to explain a bit more about kind of options you have, and what is the syntax understood by the pandas agg() function.

What are we dealing with?

The type of the object you are dealing with is

type(df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']])
# pandas.core.groupby.generic.DataFrameGroupBy

Therefore, it might be a good start to time look at the documentation of the pandas.DataFrameGroupby.agg

What do the docs say?

The argument given to agg can be

(1) string (function name)
(2) function
(3) list of functions
(4) dict of column  names -> functions (or list of functions).

(1) string (function name)

If you give function name string as the argument, it must be a "function name that pandas understands". Understood function names are at least: 'sum','mean','std'. Example:

In [24]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg('mean')
Out[24]:
                      POPESTIMATE2010  POPESTIMATE2011
STNAME
Alabama                  71420.313433     71658.328358
Alaska                   24621.413793     24921.379310
Arizona                 427213.866667    431248.800000
...

Since you want two things to be calculated: mean and sum, you need two function calls. One with 'mean' and one with 'sum'.

(2) function

You can also give any function as an argument. The function should take array-like data (pd.Series) as input, and produce scalar value out of it. Example:

In [25]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg(np.mean)
Out[25]:
                      POPESTIMATE2010  POPESTIMATE2011
STNAME
Alabama                  71420.313433     71658.328358
Alaska                   24621.413793     24921.379310
...

Since you want two things to be calculated: mean and sum, you need two function calls. One with np.mean and one with np.sum.

(3) list of functions

You may also give a list of functions for the argument for agg(). Example:

In [27]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg([np.mean, np.sum])
Out[27]:
                     POPESTIMATE2010           POPESTIMATE2011
                                mean       sum            mean       sum
STNAME
Alabama                 71420.313433   4785161    71658.328358   4801108
Alaska                  24621.413793    714021    24921.379310    722720
Arizona                427213.866667   6408208   431248.800000   6468732
...

This has the upside that you need only one function call. Use this, if you want the same operations to be calculated with multiple columns

(4) Dict to column names -> functions

If you give a dictionary as the argument to agg(), then the keys must represent the column names in the dataframe, and the values should be either functions or lists of functions. Example:

In [30]: In [27]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg({"POPESTIMATE2010": [np.mean, np.sum], "POPESTIMATE2011": [np.mean, np.sum]})
Out[30]:
                     POPESTIMATE2010           POPESTIMATE2011
                             average       sum            mean       sum
STNAME
Alabama                 71420.313433   4785161    71658.328358   4801108
Alaska                  24621.413793    714021    24921.379310    722720
Arizona                427213.866667   6408208   431248.800000   6468732
...

This has the upside that you need only one function call. Use this, if you want different operations to be calculated with different columns

like image 143
np8 Avatar answered Jan 06 '23 01:01

np8


How about:

df.groupby('STNAME')[['POPESTIMATE2010','POPESTIMATE2011']].agg(['mean', 'sum'])

Note that you need double square bracket after groupby in this case.

like image 41
Quang Hoang Avatar answered Jan 06 '23 02:01

Quang Hoang