I am trying to understand what the equivalent of this simple SQL statement would be: <pre class="prettyprint"><code>select mykey, sum(Field1) as sum_of_field1, avg(Field1) as avg_field1, min(field2) as min_field2 from df group by mykey </code></pre> I understand I can passa a dictionary to the agg() function: <pre class="prettyprint"><code> f = {'Field1':'sum', 'Field2':['max','mean'], 'Field3':['min','mean','count'], 'Field4':'count' } grouped = df.groupby('mykey').agg(f) </code></pre> However, the resulting column names seem to be chosen by pandas automatically: <code>('Field1','sum')</code> etc. Is there a way to pass strings for column names, so that the field is not <code>('Field1','sum')</code> but something I can choose, like sum_of_field1 ? Thanks. I looked at the docs here: http://pandas.pydata.org/pandas-docs/stable/groupby.html but couldn't quite find an answer.

I agree this is a bit frustrating butI do find chaining with a <code>rename</code> method served my purpose. Also, when it gets really complex, I will just reset the column names. It is a MultiIndex so it is immutable, and you should feel comfortable dealing with levels. Based on the pandas documentation <blockquote> The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this </blockquote> <pre class="prettyprint"><code>In [67]: (grouped['C'].agg([np.sum, np.mean, np.std]) ....: .rename(columns={'sum': 'foo', ....: 'mean': 'bar', ....: 'std': 'baz'}) ....: ) ....: Out[67]: foo bar baz A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265 </code></pre> When there are multiples uses of one function and you want to name it differently, this question of dropping the level and joining the different levels by underscore will help. If you do find the sql syntax cleaner, there is a library called pandasql that give you this flexibility.

python pandas: applying different aggregate functions to different columns

Tags:

python

pandas

dataframe

I am trying to understand what the equivalent of this simple SQL statement would be:

select mykey, sum(Field1) as sum_of_field1, avg(Field1) as avg_field1, min(field2) as min_field2
from df
group by mykey

I understand I can passa a dictionary to the agg() function:

  f = {'Field1':'sum',
         'Field2':['max','mean'],
         'Field3':['min','mean','count'],
         'Field4':'count'
         }

    grouped = df.groupby('mykey').agg(f)

However, the resulting column names seem to be chosen by pandas automatically: ('Field1','sum') etc.

Is there a way to pass strings for column names, so that the field is not ('Field1','sum') but something I can choose, like sum_of_field1 ?

Thanks. I looked at the docs here: http://pandas.pydata.org/pandas-docs/stable/groupby.html but couldn't quite find an answer.

871

asked Sep 03 '15 11:09

Pythonista anonymous

2 Answers

As of pandas 0.25, this is possible with a "Named aggregation".

In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ....:                         'height': [9.1, 6.0, 9.5, 34.0],
   ....:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ....: 

In [80]: animals
Out[80]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [82]: animals.groupby("kind").agg(
   ....:     min_height=('height', 'min'),
   ....:     max_height=('height', 'max'),
   ....:     average_weight=('weight', np.mean),
   ....: )
   ....: 
Out[82]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

The previously deprecated version follows:

You can pass a dictionary of dictionaries to .agg mapping {column: {name: aggfunc}}, for example

In [46]: df.head()
Out[46]:
   Year  qtr  realgdp  realcons  realinvs  realgovt  realdpi  cpi_u      M1  \
0  1950    1   1610.5    1058.9     198.1     361.0   1186.1   70.6  110.20
1  1950    2   1658.8    1075.9     220.4     366.4   1178.1   71.4  111.75
2  1950    3   1723.0    1131.0     239.7     359.6   1196.5   73.2  112.95
3  1950    4   1753.9    1097.6     271.8     382.5   1210.0   74.9  113.93
4  1951    1   1773.5    1122.8     242.9     421.9   1207.9   77.3  115.08

   tbilrate  unemp      pop     infl  realint
0      1.12    6.4  149.461   0.0000   0.0000
1      1.17    5.6  150.260   4.5071  -3.3404
2      1.23    4.6  151.064   9.9590  -8.7290
3      1.35    4.2  151.871   9.1834  -7.8301
4      1.40    3.5  152.393  12.6160 -11.2160

In [47]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
                                "unemp": {"mean_unemp": "mean"}})
Out[47]:
         realgdp                   unemp
        mean_gdp      std_gdp mean_unemp
qtr
1    4506.439216  2104.195963   5.694118
2    4546.043137  2121.824090   5.686275
3    4580.507843  2132.897955   5.662745
4    4617.592157  2158.132698   5.654902

The result has a MultiIndex in the columns. If you don't want that outer level, you can use .columns.droplevel(0).

139

answered Sep 21 '22 16:09

TomAugspurger

I agree this is a bit frustrating butI do find chaining with a rename method served my purpose. Also, when it gets really complex, I will just reset the column names. It is a MultiIndex so it is immutable, and you should feel comfortable dealing with levels.

Based on the pandas documentation

The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this

In [67]: (grouped['C'].agg([np.sum, np.mean, np.std])
   ....:              .rename(columns={'sum': 'foo',
   ....:                               'mean': 'bar',
   ....:                               'std': 'baz'})
   ....: )
   ....: 
Out[67]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

When there are multiples uses of one function and you want to name it differently, this question of dropping the level and joining the different levels by underscore will help.

If you do find the sql syntax cleaner, there is a library called pandasql that give you this flexibility.

answered Sep 20 '22 16:09

B.Mr.W.

Related questions
                            
                                get list of selected objects as string Blender python
                            
                                Python selenium : Explicitly wait for one of two elements to be loaded
                            
                                How can I get the list of names used in a formatting string?
                            
                                Prevent matplotlib statefulness
                            
                                combining two python dictionaries into one when the net values are not positive [duplicate]
                            
                                Using GPU despite setting CPU_Only, yielding unexpected keyword argument
                            
                                Bug in pandas query() method?
                            
                                Crawling through pages with PostBack data javascript Python Scrapy
                            
                                How to align positive and negative values along the decimal point in python
                            
                                pandas: read_csv how to force bool data to dtype bool instead of object
                            
                                Difference between ssl_context options in Python Flask
                            
                                Get the current function name in the Flask template context
                            
                                Python class attributes and their initialization
                            
                                Error: function() takes at least n arguments (n given)
                            
                                How to install packages with pip in Windows PowerShell
                            
                                Insert html string into BeautifulSoup object
                            
                                Keeping the data of a variable between runs of code
                            
                                Get point associated with Voronoi region (scipy.spatial.Voronoi)
                            
                                Pycharm unresolved reference in virtualenv
                            
                                Get URL parameters for bokeh application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With