I am trying to use groupby
, nlargest
, and sum
functions in Pandas together, but having trouble making it work.
State County Population Alabama a 100 Alabama b 50 Alabama c 40 Alabama d 5 Alabama e 1 ... Wyoming a.51 180 Wyoming b.51 150 Wyoming c.51 56 Wyoming d.51 5
I want to use groupby
to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.
In the end, I'll have a list that will have the state and the population (of it's top 2 counties).
I can get the groupby
and nlargest
to work, but getting the sum of the nlargest(2)
is a challenge.
The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)
groupby('State')['Population']. nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply / agg . This is slightly slower than using apply on larger DataFrames though.
Python's Pandas module provide easy ways to do aggregation and calculate metrics. Finding Top 5 maximum value for each group can also be achieved while doing the group by. The function that is helpful for finding the Top 5 maximum value is nlargest().
The agg() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.
You can use apply
after performing the groupby
:
df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
I think this issue you're having is that df.groupby('State')['Population'].nlargest(2)
will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply
/agg
.
The resulting output:
State Alabama 150 Wyoming 330
EDIT
A slightly cleaner approach, as suggested by @cᴏʟᴅsᴘᴇᴇᴅ:
df.groupby('State')['Population'].nlargest(2).sum(level=0)
This is slightly slower than using apply
on larger DataFrames though.
Using the following setup:
import numpy as np import pandas as pd from string import ascii_letters n = 10**6 df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n), 'B': np.random.randint(10**7, size=n)})
I get the following timings:
In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum()) 103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0) 147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slower performance is potentially caused by the level
kwarg in sum
performing a second groupby
under the hood.
Using agg
, the grouping logic looks like:
df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})
This results in another dataframe object; which you could query to find the most populous states, etc.
Population State Alabama 150 Wyoming 330
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With