When group a pandas dataframe by one column say "version" and which has 10 distinct versions. How can one plot the Top 3 (which cover over 90%) and put the small remainders into one "Other"-Bucket.
data = array([
('Top1', 14),
('Top1', 3),
('Top1', 2),
('Top2', 6),
('Top2', 7),
('Other1', 1),
('Other2', 2),
],
dtype=[('Version', 'S10'),('Value', '<i4')])
df = DataFrame.from_records(data)
df.groupby('Version').sum()
This returns:
Value
Version
Other1 1
Other2 2
Top1 19
Top2 13
Im Looking for
Value
Version
Others
Top1 19
Top2 13
The version names Other* and Top* are just chosen for the example.
Of course this is possible by manually setting the category to "Other" after grouping and comparing to a threshold. I was hoping for a shortcut.
I assume you also want the Other
group to be summed, for your example to a total of 3?
If i was aiming to win the Pandas one-liner competition this would be my entry:
df.replace(df.groupby('Version').sum().sort('Value', ascending=False).index[2:], 'Other').groupby('Version').sum()
Value
Version
Other 3
Top1 19
Top2 13
But that's completely unreadable, so lets break it down:
You already showed how to sum each group, sorting this result and selecting anything outside of the top 2 can be done with:
not_top2 = df.groupby('Version').sum().sort('Value', ascending=False).index[2:]
In this example not_top2
contains Other1
and Other2
.
We can replace those Versions
to a common name with:
dfnew = df.replace(not_top2, 'Other')
print dfnew
Version Value
0 Top1 14
1 Top1 3
2 Top1 2
3 Top2 6
4 Top2 7
5 Other 1
6 Other 2
The above replaces the contents of not_top2
in any column. A little substep is needed if you expect this value to occur in any other column than Version
.
Whats left is to do your original grouping again:
dfnew.groupby('Version').sum()
Which gives:
Value
Version
Other 3
Top1 19
Top2 13
# number of top-n you want
n = 2
# group by & sort descending
df_sorted = (df
.groupby('Version').sum()
.sort_values('Value', ascending=False)
.reset_index()
)
# rename rows other than top-n to 'Others'
df_sorted.loc[df_sorted.index >= n, 'Version'] = 'Others'
# re-group by again
df_sorted.groupby('Version').sum()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With