Lets say I have the following data:
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
s2.value_counts(normalize=True).plot()
What I want to show in the plot is that there are a few numbers that make up the majority of cases.The problem is that this will be seen in the far left side of the graph and then there will be a straight line for all the other categories. In the real data the x axis will be categorical with about 18000 categories and 4% of the counts will be around 10000 high then the rest will drop of and be around 50.
I want to show this for an audience of "ordinary" business people so cant be some fanzy hard to read solution.
Update: see @unutbu answere
Updated code and im getting an error for qcut
when trying to use tuples.
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]})
perms = df.apply(tuple, axis=1)
prob = perms.value_counts(normalize=True).reset_index(drop='True')
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
You could keep the normalized value counts above a certain threshold
. Then sum together the values below the threshold
and clump them together in one category which could be called, say, "other".
By choosing threshold
high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other":
import matplotlib.pyplot as plt
import pandas as pd
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
There is a limit to the number of category labels you can sensibly display on a bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is probably not reasonable to expect an audience to glean any meaning out of reading 3000 labels.
The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut
to categorize the cases into simple categories such as bottom 25%
, mid 70%
, and top 5%
:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With