I have a data frame and I would like to group it by a particular column (or, in other words, by values from a particular column). I can do it in the following way: grouped = df.groupby(['ColumnName']).
I imagine the result of this operation as a table in which some cells can contain sets of values instead of single values. To get a usual table (i.e. a table in which every cell contains only one a single value) I need to indicate what function I want to use to transform the sets of values in the cells into single values.
For example I can replace sets of values by their sum, or by their minimal or maximal value. I can do it in the following way: grouped.sum() or grouped.min() and so on.
Now I want to use different functions for different columns. I figured out that I can do it in the following way: grouped.agg({'ColumnName1':sum, 'ColumnName2':min}).
However, because of some reasons I cannot use first. In more details, grouped.first() works, but grouped.agg({'ColumnName1':first, 'ColumnName2':first}) does not work. As a result I get a NameError: NameError: name 'first' is not defined. So, my question is: Why does it happen and how to resolve this problem.
ADDED
Here I found the following example:
grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean}) May be I also need to use np? But in my case python does not recognize "np". Should I import it?
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.
(1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure. Out of these, Pandas groupby() is widely used for the split step and it's the most straightforward.
I think the issue is that there are two different first methods which share a name but act differently, one is for groupby objects and another for a Series/DataFrame (to do with timeseries).
To replicate the behaviour of the groupby first method over a DataFrame using agg you could use iloc[0] (which gets the first row in each group (DataFrame/Series) by index):
grouped.agg(lambda x: x.iloc[0]) For example:
In [1]: df = pd.DataFrame([[1, 2], [3, 4]]) In [2]: g = df.groupby(0) In [3]: g.first() Out[3]: 1 0 1 2 3 4 In [4]: g.agg(lambda x: x.iloc[0]) Out[4]: 1 0 1 2 3 4 Analogously you can replicate last using iloc[-1].
Note: This will works column-wise, et al:
g.agg({1: lambda x: x.iloc[0]}) In older version of pandas you could would use the irow method (e.g. x.irow(0), see previous edits.
A couple of updated notes:
This is better done using the nth groupby method, which is much faster >=0.13:
g.nth(0) # first g.nth(-1) # last You have to take care a little, as the default behaviour for first and last ignores NaN rows... and IIRC for DataFrame groupbys it was broken pre-0.13... there's a dropna option for nth.
You can use the strings rather than built-ins (though IIRC pandas spots it's the sum builtin and applies np.sum):
grouped['D'].agg({'result1' : "sum", 'result2' : "mean"})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With