I have the following code to draw some histograms about subjects in a database:
import matplotlib.pyplot as plt
attr_info = {
'Gender': ['m', 'f', 'm', 'm', 'f', 'm', 'm', 'f', 'm', 'f'],
'Age': [9, 43, 234, 23, 2, 95, 32, 63, 58, 42],
'Smoker': ['y', 'n', 'y', 'y', 'n', 'n', 'n', 'n', 'y', 'y']
}
bin_info = {key: None for key in attr_info}
bin_info['Age'] = 10
for name, a_info in attr_info.items():
plt.figure(num=name)
counts, bins, _ = plt.hist(a_info, bins=bin_info[name], color='blue', edgecolor='black')
plt.margins(0)
plt.title(name)
plt.xlabel(name)
plt.ylabel("# Subjects")
plt.yticks(range(0, 11, 2))
plt.grid(axis='y')
plt.tight_layout(pad=0)
plt.show()
This code works but it draws each attribute's distribution in a separate histogram. What I'd like to achieve is something like this:

I'm aware plt.hist has a stacked parameter, but that seems to be intended for a slightly different use, where you're stacking the same attributes on each other at different subject types. You could for example draw a histogram where each whole bar would represent some age range and the bar itself would be a stack of smokers in one colour and non-smokers in another.
I haven't been able to figure out how to use it to stack (and properly label as in the image) different attributes on top of each other in each bar.
You need to play around with your data a bit, but this can be done without pandas. Also, what you want are stacked bar plots, not histograms:
import matplotlib.pyplot as plt
attr_info = {
'Gender': ['m', 'f', 'm', 'm', 'f', 'm', 'm', 'f', 'm', 'f'],
'Age': [9, 43, 234, 23, 2, 95, 32, 63, 58, 42],
'Smoker': ['y', 'n', 'y', 'y', 'n', 'n', 'n', 'n', 'y', 'y']
}
# Filter your data for each bar section that you want
ages_0_10 = [x for x in attr_info['Age'] if x < 10]
ages_10_40 = [x for x in attr_info['Age'] if x >= 10 and x < 40]
ages_40p = [x for x in attr_info['Age'] if x > 40]
gender_m = [x for x in attr_info['Gender'] if 'm' in x]
gender_f = [x for x in attr_info['Gender'] if 'f' in x]
smoker_y = [x for x in attr_info['Smoker'] if 'y' in x]
smoker_n = [x for x in attr_info['Smoker'] if 'n' in x]
# Locations for each bin (you can move them around)
locs = [0, 1, 2]
# I'm going to plot the Ages bin separate than the Smokers and Gender ones,
# since Age has 3 stacked bars and the other have just 2 each
plt.bar(locs[0], len(ages_0_10), width=0.5) # This is the bottom bar
# Second stacked bar, note the bottom variable assigned to the previous bar
plt.bar(locs[0], len(ages_10_40), bottom=len(ages_0_10), width=0.5)
# Same as before but now bottom is the 2 previous bars
plt.bar(locs[0], len(ages_40p), bottom=len(ages_0_10) + len(ages_10_40), width=0.5)
# Add labels, play around with the locations
#plt.text(x, y, text)
plt.text(locs[0], len(ages_0_10) / 2, r'$<10$')
plt.text(locs[0], len(ages_0_10) + 1, r'$[10, 40]$')
plt.text(locs[0], len(ages_0_10) + 5, r'$>40$')
# Define the top bars and bottom bars for the Gender and Smokers stack
# In both cases is just 2 stacked bars,
# so we can use a list for this instead of doing it separate as for Age
tops = [len(gender_m), len(smoker_y)]
bottoms = [len(gender_f), len(smoker_n)]
plt.bar(locs[1:], bottoms, width=0.5)
plt.bar(locs[1:], tops, bottom=bottoms, width=0.5)
# Labels again
# Gender
plt.text(locs[1], len(gender_m) / 2, 'm')
plt.text(locs[1], len(gender_m) + 2, 'f')
# Smokers
plt.text(locs[2], len(smoker_y) / 2, 'y')
plt.text(locs[2], len(smoker_n) + 2, 'n')
# Set tick labels
plt.xticks(locs, ('Age', 'Gender', 'Smoker'))
plt.show()
Result:

Check the documentation for pyplot.bar and this example.
How about trying out pandas:
import pandas as pd
attr_info = {
'Gender': ['m', 'f', 'm', 'm', 'f', 'm', 'm', 'f', 'm', 'f'],
'Age': [9, 43, 234, 23, 2, 95, 32, 63, 58, 42],
'Smoker': ['y', 'n', 'y', 'y', 'n', 'n', 'n', 'n', 'y', 'y']
}
df = pd.DataFrame(attr_info)
bins = [0,32,45,300] #bins can be adjusted to your liking
#deselect "Age" and select all remaining columns
counts = df.filter(regex="[^Age]").apply(pd.Series.value_counts)
#bin age data and count
age_data = df.groupby(pd.cut(df['Age'], bins=bins))["Age"].count()
fig, ax = plt.subplots()
pd.concat([counts,age_data]).rename(columns={0:"Age"}).T.plot(kind="bar", stacked=True, ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Output:

The advantage of this approach is its generality, no matter how many columns you want to plot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With