Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group "remaining" results beyond Top N into "Others" with pandas

Tags:

python

pandas

When group a pandas dataframe by one column say "version" and which has 10 distinct versions. How can one plot the Top 3 (which cover over 90%) and put the small remainders into one "Other"-Bucket.

data = array([
              ('Top1', 14),
              ('Top1', 3),
              ('Top1', 2),
              ('Top2', 6),
              ('Top2', 7),
              ('Other1', 1),
              ('Other2', 2),
         ], 
      dtype=[('Version', 'S10'),('Value', '<i4')])
df = DataFrame.from_records(data)
df.groupby('Version').sum()

This returns:

Value
Version 
Other1   1
Other2   2
Top1     19
Top2     13

Im Looking for

Value
Version 
Others   
Top1     19
Top2     13

The version names Other* and Top* are just chosen for the example.

Of course this is possible by manually setting the category to "Other" after grouping and comparing to a threshold. I was hoping for a shortcut.

like image 933
Cilvic Avatar asked Dec 19 '22 21:12

Cilvic


2 Answers

I assume you also want the Other group to be summed, for your example to a total of 3?

If i was aiming to win the Pandas one-liner competition this would be my entry:

df.replace(df.groupby('Version').sum().sort('Value', ascending=False).index[2:], 'Other').groupby('Version').sum()

         Value
Version       
Other        3
Top1        19
Top2        13

But that's completely unreadable, so lets break it down:

You already showed how to sum each group, sorting this result and selecting anything outside of the top 2 can be done with:

not_top2 = df.groupby('Version').sum().sort('Value', ascending=False).index[2:]

In this example not_top2 contains Other1 and Other2.

We can replace those Versions to a common name with:

dfnew  = df.replace(not_top2, 'Other')
print dfnew

  Version  Value
0    Top1     14
1    Top1      3
2    Top1      2
3    Top2      6
4    Top2      7
5   Other      1
6   Other      2

The above replaces the contents of not_top2 in any column. A little substep is needed if you expect this value to occur in any other column than Version.

Whats left is to do your original grouping again:

dfnew.groupby('Version').sum()

Which gives:

         Value
Version       
Other        3
Top1        19
Top2        13
like image 79
Rutger Kassies Avatar answered May 21 '23 09:05

Rutger Kassies


# number of top-n you want
n = 2

# group by & sort descending
df_sorted = (df
                .groupby('Version').sum()
                .sort_values('Value', ascending=False)
                .reset_index()
            )

# rename rows other than top-n to 'Others'
df_sorted.loc[df_sorted.index >= n, 'Version'] = 'Others'

# re-group by again
df_sorted.groupby('Version').sum()
like image 27
cakraww Avatar answered May 21 '23 11:05

cakraww