Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sum the nlargest() integers in groupby [duplicate]

I have a dataframe like this:

    Index STNAME COUNTY COUNTY_POP
      0     AL     0       100
      1     AL     1       150
      2     AL     3       200
      3     AL     5       50
    ...
     15     CA     0       300
     16     CA     1       200
     17     CA     3       250
     18     CA     4       350

I want to sum the three largest integers from COUNTY_POP for each state. So far, I have:

    In[]: df.groupby(['STNAME'])['COUNTY_POP'].nlargest(3)
    Out[]:
    Index STNAME COUNTY COUNTY_POP
      0     AL     0       100
      1     AL     1       150
      2     AL     3       200
    ...
     15     CA     0       300
     17     CA     3       250
     18     CA     4       350

However when I add the .sum() operation to the above code, I receive the following output.

    In[]: df.groupby(['STNAME'])['COUNTY_POP'].nlargest(3).sum()
    Out[]:
    1350

I'm relatively new to Python and Pandas. If anyone could explain what causes this and how to correct it, I'd really appreciate it!

like image 903
R7L208 Avatar asked Nov 09 '16 22:11

R7L208


2 Answers

Is that what you want?

In [25]: df.groupby('STNAME')['COUNTY_POP'].agg(lambda x: x.nlargest(3).sum())
Out[25]:
STNAME
AL    450
CA    900
Name: COUNTY_POP, dtype: int64
like image 64
MaxU - stop WAR against UA Avatar answered Nov 07 '22 22:11

MaxU - stop WAR against UA


presort and slice... a tad faster

df.sort_values('COUNTY_POP').groupby('STNAME').COUNTY_POP \
    .apply(lambda x: x.values[-3:].sum())

STNAME
AL    450
CA    900
Name: COUNTY_POP, dtype: int64
like image 30
piRSquared Avatar answered Nov 07 '22 22:11

piRSquared