I am trying to understand what the equivalent of this simple SQL statement would be:
select mykey, sum(Field1) as sum_of_field1, avg(Field1) as avg_field1, min(field2) as min_field2
from df
group by mykey
I understand I can passa a dictionary to the agg() function:
  f = {'Field1':'sum',
         'Field2':['max','mean'],
         'Field3':['min','mean','count'],
         'Field4':'count'
         }
    grouped = df.groupby('mykey').agg(f)
However, the resulting column names seem to be chosen by pandas automatically: ('Field1','sum') etc.
Is there a way to pass strings for column names, so that the field is not ('Field1','sum') but something I can choose, like sum_of_field1 ? 
Thanks. I looked at the docs here: http://pandas.pydata.org/pandas-docs/stable/groupby.html but couldn't quite find an answer.
To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary. Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I'd recommend flattening this after aggregating by renaming the new columns.
groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Using apply() method If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas. DataFrame. apply() method should do the trick.
As of pandas 0.25, this is possible with a "Named aggregation".
In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ....:                         'height': [9.1, 6.0, 9.5, 34.0],
   ....:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ....: 
In [80]: animals
Out[80]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0
In [82]: animals.groupby("kind").agg(
   ....:     min_height=('height', 'min'),
   ....:     max_height=('height', 'max'),
   ....:     average_weight=('weight', np.mean),
   ....: )
   ....: 
Out[82]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75
The previously deprecated version follows:
You can pass a dictionary of dictionaries to .agg mapping {column: {name: aggfunc}}, for example
In [46]: df.head()
Out[46]:
   Year  qtr  realgdp  realcons  realinvs  realgovt  realdpi  cpi_u      M1  \
0  1950    1   1610.5    1058.9     198.1     361.0   1186.1   70.6  110.20
1  1950    2   1658.8    1075.9     220.4     366.4   1178.1   71.4  111.75
2  1950    3   1723.0    1131.0     239.7     359.6   1196.5   73.2  112.95
3  1950    4   1753.9    1097.6     271.8     382.5   1210.0   74.9  113.93
4  1951    1   1773.5    1122.8     242.9     421.9   1207.9   77.3  115.08
   tbilrate  unemp      pop     infl  realint
0      1.12    6.4  149.461   0.0000   0.0000
1      1.17    5.6  150.260   4.5071  -3.3404
2      1.23    4.6  151.064   9.9590  -8.7290
3      1.35    4.2  151.871   9.1834  -7.8301
4      1.40    3.5  152.393  12.6160 -11.2160
In [47]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
                                "unemp": {"mean_unemp": "mean"}})
Out[47]:
         realgdp                   unemp
        mean_gdp      std_gdp mean_unemp
qtr
1    4506.439216  2104.195963   5.694118
2    4546.043137  2121.824090   5.686275
3    4580.507843  2132.897955   5.662745
4    4617.592157  2158.132698   5.654902
The result has a MultiIndex in the columns. If you don't want that outer level, you can use .columns.droplevel(0).
I agree this is a bit frustrating butI do find chaining with a rename method served my purpose. Also, when it gets really complex, I will just reset the column names. It is a MultiIndex so it is immutable, and you should feel comfortable dealing with levels. 
Based on the pandas documentation
The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this
In [67]: (grouped['C'].agg([np.sum, np.mean, np.std])
   ....:              .rename(columns={'sum': 'foo',
   ....:                               'mean': 'bar',
   ....:                               'std': 'baz'})
   ....: )
   ....: 
Out[67]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265
When there are multiples uses of one function and you want to name it differently, this question of dropping the level and joining the different levels by underscore will help.
If you do find the sql syntax cleaner, there is a library called pandasql that give you this flexibility.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With