How to apply "first" and "last" functions to columns while using group by in pandas?

Tags:

I have a data frame and I would like to group it by a particular column (or, in other words, by values from a particular column). I can do it in the following way: grouped = df.groupby(['ColumnName']).

I imagine the result of this operation as a table in which some cells can contain sets of values instead of single values. To get a usual table (i.e. a table in which every cell contains only one a single value) I need to indicate what function I want to use to transform the sets of values in the cells into single values.

For example I can replace sets of values by their sum, or by their minimal or maximal value. I can do it in the following way: grouped.sum() or grouped.min() and so on.

Now I want to use different functions for different columns. I figured out that I can do it in the following way: grouped.agg({'ColumnName1':sum, 'ColumnName2':min}).

However, because of some reasons I cannot use first. In more details, grouped.first() works, but grouped.agg({'ColumnName1':first, 'ColumnName2':first}) does not work. As a result I get a NameError: NameError: name 'first' is not defined. So, my question is: Why does it happen and how to resolve this problem.

ADDED

Here I found the following example:

grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})

May be I also need to use np? But in my case python does not recognize "np". Should I import it?

793

asked Feb 21 '13 11:02

Roman

1 Answers

I think the issue is that there are two different first methods which share a name but act differently, one is for groupby objects and another for a Series/DataFrame (to do with timeseries).

To replicate the behaviour of the groupby first method over a DataFrame using agg you could use iloc[0] (which gets the first row in each group (DataFrame/Series) by index):

grouped.agg(lambda x: x.iloc[0])

For example:

In [1]: df = pd.DataFrame([[1, 2], [3, 4]])  In [2]: g = df.groupby(0)  In [3]: g.first() Out[3]:     1 0    1  2 3  4  In [4]: g.agg(lambda x: x.iloc[0]) Out[4]:     1 0    1  2 3  4

Analogously you can replicate last using iloc[-1].

Note: This will works column-wise, et al:

g.agg({1: lambda x: x.iloc[0]})

In older version of pandas you could would use the irow method (e.g. x.irow(0), see previous edits.

A couple of updated notes:

This is better done using the nth groupby method, which is much faster >=0.13:

g.nth(0)  # first g.nth(-1)  # last

You have to take care a little, as the default behaviour for first and last ignores NaN rows... and IIRC for DataFrame groupbys it was broken pre-0.13... there's a dropna option for nth.

You can use the strings rather than built-ins (though IIRC pandas spots it's the sum builtin and applies np.sum):

grouped['D'].agg({'result1' : "sum", 'result2' : "mean"})

114

answered Sep 18 '22 13:09

Andy Hayden

Related questions
                            
                                Automatically Rescale ylim and xlim in Matplotlib
                            
                                How to retain column headers of data frame after Pre-processing in scikit-learn
                            
                                Lambdas from a list comprehension are returning a lambda when called
                            
                                What does calling fit() multiple times on the same model do?
                            
                                How to print UTF-8 encoded text to the console in Python < 3?
                            
                                How to write to stdout AND to log file simultaneously with Popen?
                            
                                What is the safest way to removing Python framework files that are located in different place than Brew installs
                            
                                Python interface for R Programming Language [duplicate]
                            
                                Does `anaconda` create a separate PYTHONPATH variable for each new environment?
                            
                                Correct way to set new column in pandas DataFrame to avoid SettingWithCopyWarning
                            
                                How do you access an authenticated Google App Engine service from a (non-web) python client?
                            
                                Why does pip freeze report some packages in a fresh virtualenv created with --no-site-packages?
                            
                                Can you perform multi-threaded tasks within Django?
                            
                                How do I transpose dataframe in pandas without index?
                            
                                What does the "yield from" syntax do in asyncio and how is it different from "await"
                            
                                Tab completion in Python's raw_input()
                            
                                Big-O of list slicing
                            
                                What does Django's @property do?
                            
                                Simplest way of checking for string that contains a string in list? [duplicate]
                            
                                Cross-correlation (time-lag-correlation) with pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to apply "first" and "last" functions to columns while using group by in pandas?

Tags:

python

pandas

group-by

Roman

People also ask

1 Answers

Andy Hayden

Recent Activity

Donate For Us