Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby nlargest sum

Tags:

I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.

State    County    Population Alabama  a         100 Alabama  b         50 Alabama  c         40 Alabama  d         5 Alabama  e         1 ... Wyoming  a.51      180 Wyoming  b.51      150 Wyoming  c.51      56 Wyoming  d.51      5 

I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.

In the end, I'll have a list that will have the state and the population (of it's top 2 counties).

I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.

The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)

like image 860
user7102752 Avatar asked Nov 02 '16 22:11

user7102752


People also ask

How do you use Nlargest with Groupby in Python?

groupby('State')['Population']. nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply / agg . This is slightly slower than using apply on larger DataFrames though.

How do you find the top 5 values in Python?

Python's Pandas module provide easy ways to do aggregation and calculate metrics. Finding Top 5 maximum value for each group can also be achieved while doing the group by. The function that is helpful for finding the Top 5 maximum value is nlargest().

What does .AGG do in Python?

The agg() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.


2 Answers

You can use apply after performing the groupby:

df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum()) 

I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.

The resulting output:

State Alabama    150 Wyoming    330 

EDIT

A slightly cleaner approach, as suggested by @cᴏʟᴅsᴘᴇᴇᴅ:

df.groupby('State')['Population'].nlargest(2).sum(level=0) 

This is slightly slower than using apply on larger DataFrames though.

Using the following setup:

import numpy as np import pandas as pd from string import ascii_letters  n = 10**6 df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),                    'B': np.random.randint(10**7, size=n)}) 

I get the following timings:

In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum()) 103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0) 147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.

like image 65
root Avatar answered Oct 04 '22 06:10

root


Using agg, the grouping logic looks like:

df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})

This results in another dataframe object; which you could query to find the most populous states, etc.

           Population State Alabama    150 Wyoming    330 
like image 41
aquaraga Avatar answered Oct 04 '22 07:10

aquaraga