I am working with this data set census.csv
Code:
df = pd.read_csv('Data/census.csv')
df = df[df['SUMLEV']==50]
print(df.head())
This is how my data looks like:
SUMLEV REGION DIVISION ... RNETMIG2013 RNETMIG2014 RNETMIG2015
1 50 3 6 ... -2.722002 2.592270 -2.187333
2 50 3 6 ... 22.727626 20.317142 18.293499
3 50 3 6 ... -7.167664 -3.978583 -10.543299
4 50 3 6 ... -5.403729 0.754533 1.107861
5 50 3 6 ... -1.402476 -1.577232 -0.884411
I wanted to take the aggregate of two columns after grouping by 'STNAME':
(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg(Avg= np.average, Sum = np.sum))
Error:
----> 3 (df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg(Avg= np.average, Sum = np.sum))
f:\software_installations\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
922 elif func is None:
923 # nicer error message
--> 924 raise TypeError("Must provide 'func' or tuples of '(column, aggfunc).")
925
926 func = _maybe_mangle_lambdas(func)
TypeError: Must provide 'func' or tuples of '(column, aggfunc).
While others were fast to give the one-liner code snippet, I try to explain a bit more about kind of options you have, and what is the syntax understood by the pandas agg()
function.
The type of the object you are dealing with is
type(df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']])
# pandas.core.groupby.generic.DataFrameGroupBy
Therefore, it might be a good start to time look at the documentation of the pandas.DataFrameGroupby.agg
The argument given to agg
can be
(1) string (function name)
(2) function
(3) list of functions
(4) dict of column names -> functions (or list of functions).
If you give function name string as the argument, it must be a "function name that pandas understands". Understood function names are at least: 'sum','mean','std'
. Example:
In [24]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg('mean')
Out[24]:
POPESTIMATE2010 POPESTIMATE2011
STNAME
Alabama 71420.313433 71658.328358
Alaska 24621.413793 24921.379310
Arizona 427213.866667 431248.800000
...
Since you want two things to be calculated: mean and sum, you need two function calls. One with 'mean' and one with 'sum'.
You can also give any function as an argument. The function should take array-like data (pd.Series) as input, and produce scalar value out of it. Example:
In [25]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg(np.mean)
Out[25]:
POPESTIMATE2010 POPESTIMATE2011
STNAME
Alabama 71420.313433 71658.328358
Alaska 24621.413793 24921.379310
...
Since you want two things to be calculated: mean and sum, you need two function calls. One with np.mean and one with np.sum.
You may also give a list of functions for the argument for agg()
. Example:
In [27]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg([np.mean, np.sum])
Out[27]:
POPESTIMATE2010 POPESTIMATE2011
mean sum mean sum
STNAME
Alabama 71420.313433 4785161 71658.328358 4801108
Alaska 24621.413793 714021 24921.379310 722720
Arizona 427213.866667 6408208 431248.800000 6468732
...
This has the upside that you need only one function call. Use this, if you want the same operations to be calculated with multiple columns
If you give a dictionary as the argument to agg()
, then the keys must represent the column names in the dataframe, and the values should be either functions or lists of functions. Example:
In [30]: In [27]: df.set_index('STNAME').groupby(level=0)[['POPESTIMATE2010','POPESTIMATE2011']].agg({"POPESTIMATE2010": [np.mean, np.sum], "POPESTIMATE2011": [np.mean, np.sum]})
Out[30]:
POPESTIMATE2010 POPESTIMATE2011
average sum mean sum
STNAME
Alabama 71420.313433 4785161 71658.328358 4801108
Alaska 24621.413793 714021 24921.379310 722720
Arizona 427213.866667 6408208 431248.800000 6468732
...
This has the upside that you need only one function call. Use this, if you want different operations to be calculated with different columns
How about:
df.groupby('STNAME')[['POPESTIMATE2010','POPESTIMATE2011']].agg(['mean', 'sum'])
Note that you need double square bracket after groupby in this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With