I am working with this data set census.csv
Code:
df = pd.read_csv('Data/census.csv')
df = df[df['SUMLEV']==50]
print(df.head())
This is how my data looks like:
   SUMLEV  REGION  DIVISION  ...  RNETMIG2013  RNETMIG2014 RNETMIG2015
1      50       3         6  ...    -2.722002     2.592270   -2.187333
2      50       3         6  ...    22.727626    20.317142   18.293499
3      50       3         6  ...    -7.167664    -3.978583  -10.543299
4      50       3         6  ...    -5.403729     0.754533    1.107861
5      50       3         6  ...    -1.402476    -1.577232   -0.884411
I wanted to take the aggregate of two columns after grouping by 'STNAME':
(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg(Avg= np.average, Sum = np.sum))
Error:
----> 3 (df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg(Avg= np.average, Sum = np.sum))
f:\software_installations\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
    922         elif func is None:
    923             # nicer error message
--> 924             raise TypeError("Must provide 'func' or tuples of '(column, aggfunc).")
    925 
    926         func = _maybe_mangle_lambdas(func)
TypeError: Must provide 'func' or tuples of '(column, aggfunc).
                While others were fast to give the one-liner code snippet, I try to explain a bit more about kind of options you have, and what is the syntax understood by the pandas agg() function.
The type of the object you are dealing with is
type(df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']])
# pandas.core.groupby.generic.DataFrameGroupBy
Therefore, it might be a good start to time look at the documentation of the pandas.DataFrameGroupby.agg
The argument given to agg can be
(1) string (function name)
(2) function
(3) list of functions
(4) dict of column  names -> functions (or list of functions).
If you give function name string as the argument, it must be a "function name that pandas understands". Understood function names are at least: 'sum','mean','std'. Example:
In [24]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg('mean')
Out[24]:
                      POPESTIMATE2010  POPESTIMATE2011
STNAME
Alabama                  71420.313433     71658.328358
Alaska                   24621.413793     24921.379310
Arizona                 427213.866667    431248.800000
...
Since you want two things to be calculated: mean and sum, you need two function calls. One with 'mean' and one with 'sum'.
You can also give any function as an argument. The function should take array-like data (pd.Series) as input, and produce scalar value out of it. Example:
In [25]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg(np.mean)
Out[25]:
                      POPESTIMATE2010  POPESTIMATE2011
STNAME
Alabama                  71420.313433     71658.328358
Alaska                   24621.413793     24921.379310
...
Since you want two things to be calculated: mean and sum, you need two function calls. One with np.mean and one with np.sum.
You may also give a list of functions for the argument for agg(). Example:
In [27]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg([np.mean, np.sum])
Out[27]:
                     POPESTIMATE2010           POPESTIMATE2011
                                mean       sum            mean       sum
STNAME
Alabama                 71420.313433   4785161    71658.328358   4801108
Alaska                  24621.413793    714021    24921.379310    722720
Arizona                427213.866667   6408208   431248.800000   6468732
...
This has the upside that you need only one function call. Use this, if you want the same operations to be calculated with multiple columns
If you give a dictionary as the argument to agg(), then the keys must represent the column names in the dataframe, and the values should be either functions or lists of functions. Example:
In [30]: In [27]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg({"POPESTIMATE2010": [np.mean, np.sum], "POPESTIMATE2011": [np.mean, np.sum]})
Out[30]:
                     POPESTIMATE2010           POPESTIMATE2011
                             average       sum            mean       sum
STNAME
Alabama                 71420.313433   4785161    71658.328358   4801108
Alaska                  24621.413793    714021    24921.379310    722720
Arizona                427213.866667   6408208   431248.800000   6468732
...
This has the upside that you need only one function call. Use this, if you want different operations to be calculated with different columns
How about:
df.groupby('STNAME')[['POPESTIMATE2010','POPESTIMATE2011']].agg(['mean', 'sum'])
Note that you need double square bracket after groupby in this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With