Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas difference between apply() and aggregate() functions

Tags:

python

pandas

is there any difference in the (type) of the return value between the DataFrame.aggregate() and the DataFrame.apply() function if I just pass a function like

func=lambda x: x**2

because the return values seems to be pretty the same. And the documentation only tells:

apply() --> applied : Series or DataFrame

aggregate() --> aggregated : DataFrame

like image 719
2Obe Avatar asked Jul 01 '17 19:07

2Obe


1 Answers

There are two versions of agg (short for aggregate) and apply: The first is defined on groupby objects and the second one is defined on DataFrames.

If you consider groupby.agg and groupby.apply, the main difference would be that the apply is flexible (docs):

Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases.

Note: apply can act as a reducer, transformer, or filter function, depending on exactly what is passed to apply. So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in the output as well as set the indices.

See Python Pandas : How to return grouped lists in a column as a dict for example for an illustration of how the returning type is automatically changed.

groupby.agg, on the other hand, is very good for applying cython optimized functions (i.e. being able to calculate 'sum', 'mean', 'std' etc. very fast). It also allows calculating multiple (different) functions on different columns. For example,

df.groupby('some_column').agg({'first_column': ['mean', 'std'],
                               'second_column': ['sum', 'sem']}

calculates the mean and the standard deviation on the first column and sum and standard error of the mean on the second column. See dplyr summarize equivalent in pandas for more examples.

These differences are also summarized in What is the difference between pandas agg and apply function? But that one focuses on the differences between groupby.agg and groupby.apply.

DataFrame.agg is new in version 0.20. Earlier, we weren't able to apply multiple different functions to different columns because it was only possible with groupby objects. Now, you can summarize a DataFrame by calculating multiple different functions on its columns. Example from Is there a pandas equivalent of dplyr::summarise?:

iris.agg({'sepal_width': 'min', 'petal_width': 'max'})

petal_width    2.5
sepal_width    2.0
dtype: float64

iris.agg({'sepal_width': ['min', 'median'], 'sepal_length': ['min', 'mean']})

        sepal_length  sepal_width
mean        5.843333          NaN
median           NaN          3.0
min         4.300000          2.0

This is not possible with DataFrame.apply. It either goes column by column or row by row and executes the same function on that column/row. For a single function like lambda x: x**2 they produce the same results but their intended usage is very different.

like image 161
ayhan Avatar answered Sep 20 '22 21:09

ayhan