I am trying to understand what the equivalent of this simple SQL statement would be:
select mykey, sum(Field1) as sum_of_field1, avg(Field1) as avg_field1, min(field2) as min_field2
from df
group by mykey
I understand I can passa a dictionary to the agg() function:
f = {'Field1':'sum',
'Field2':['max','mean'],
'Field3':['min','mean','count'],
'Field4':'count'
}
grouped = df.groupby('mykey').agg(f)
However, the resulting column names seem to be chosen by pandas automatically: ('Field1','sum')
etc.
Is there a way to pass strings for column names, so that the field is not ('Field1','sum')
but something I can choose, like sum_of_field1 ?
Thanks. I looked at the docs here: http://pandas.pydata.org/pandas-docs/stable/groupby.html but couldn't quite find an answer.
To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary. Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I'd recommend flattening this after aggregating by renaming the new columns.
groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Using apply() method If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas. DataFrame. apply() method should do the trick.
As of pandas 0.25, this is possible with a "Named aggregation".
In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
....: 'height': [9.1, 6.0, 9.5, 34.0],
....: 'weight': [7.9, 7.5, 9.9, 198.0]})
....:
In [80]: animals
Out[80]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [82]: animals.groupby("kind").agg(
....: min_height=('height', 'min'),
....: max_height=('height', 'max'),
....: average_weight=('weight', np.mean),
....: )
....:
Out[82]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
The previously deprecated version follows:
You can pass a dictionary of dictionaries to .agg
mapping {column: {name: aggfunc}}
, for example
In [46]: df.head()
Out[46]:
Year qtr realgdp realcons realinvs realgovt realdpi cpi_u M1 \
0 1950 1 1610.5 1058.9 198.1 361.0 1186.1 70.6 110.20
1 1950 2 1658.8 1075.9 220.4 366.4 1178.1 71.4 111.75
2 1950 3 1723.0 1131.0 239.7 359.6 1196.5 73.2 112.95
3 1950 4 1753.9 1097.6 271.8 382.5 1210.0 74.9 113.93
4 1951 1 1773.5 1122.8 242.9 421.9 1207.9 77.3 115.08
tbilrate unemp pop infl realint
0 1.12 6.4 149.461 0.0000 0.0000
1 1.17 5.6 150.260 4.5071 -3.3404
2 1.23 4.6 151.064 9.9590 -8.7290
3 1.35 4.2 151.871 9.1834 -7.8301
4 1.40 3.5 152.393 12.6160 -11.2160
In [47]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
"unemp": {"mean_unemp": "mean"}})
Out[47]:
realgdp unemp
mean_gdp std_gdp mean_unemp
qtr
1 4506.439216 2104.195963 5.694118
2 4546.043137 2121.824090 5.686275
3 4580.507843 2132.897955 5.662745
4 4617.592157 2158.132698 5.654902
The result has a MultiIndex in the columns. If you don't want that outer level, you can use .columns.droplevel(0)
.
I agree this is a bit frustrating butI do find chaining with a rename
method served my purpose. Also, when it gets really complex, I will just reset the column names. It is a MultiIndex so it is immutable, and you should feel comfortable dealing with levels.
Based on the pandas documentation
The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this
In [67]: (grouped['C'].agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'})
....: )
....:
Out[67]:
foo bar baz
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
When there are multiples uses of one function and you want to name it differently, this question of dropping the level and joining the different levels by underscore will help.
If you do find the sql syntax cleaner, there is a library called pandasql that give you this flexibility.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With