I have a dataset that maps continuous values to discrete categories. I want to display a histogram with the continuous values as x and categories as y, where bars are stacked and normalized. Example:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
},
columns=['score', 'category'])
print df.head(10)
Output:
score category
0 0.649371 B
1 0.042309 B
2 0.689487 A
3 0.433064 B
4 0.978859 A
5 0.789140 C
6 0.215758 D
7 0.922389 B
8 0.105364 D
9 0.010274 C
If I try to plot this as a histogram using df.hist(by='category')
, I get 4 graphs:
I managed to get the graph I wanted but I had to do a lot of manipulation.
# One column per category, 1 if maps to category, 0 otherwise
df2 = pd.DataFrame({
'score' : df.score,
'A' : (df.category == 'A').astype(float),
'B' : (df.category == 'B').astype(float),
'C' : (df.category == 'C').astype(float),
'D' : (df.category == 'D').astype(float)
},
columns=['score', 'A', 'B', 'C', 'D'])
# select "bins" of .1 width, and sum for each category
df3 = pd.DataFrame([df2[(df2.score >= (n/10.0)) & (df2.score < ((n+1)/10.0))].iloc[:, 1:].sum() for n in range(10)])
# Sum over series for weights
df4 = df3.sum(1)
bars = pd.DataFrame(df3.values / np.tile(df4.values, [4, 1]).transpose(), columns=list('ABCD'))
bars.plot.bar(stacked=True)
I expect there is a more straightforward way to do this, easier to read and understand and more optimized with less intermediate steps. Any solutions?
I dont know if this is really that much more compact or readable than what you already got but it is a suggestion (a late one as such :)).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
}, columns=['score', 'category'])
# Set the range of the score as a category using pd.cut
df.set_index(pd.cut(df['score'], np.linspace(0, 1, 11)), inplace=True)
# Count all entries for all scores and all categories
a = df.groupby([df.index, 'category']).size()
# Normalize
b = df.groupby(df.index)['category'].count()
df_a = a.div(b, axis=0,level=0)
# Plot
df_a.unstack().plot.bar(stacked=True)
Consider assigning bins with cut
, calculating grouping percentages with couple of groupby().transform
calls, and then aggregate and reshape with pivot_table
:
# CREATE BIN INDICATORS
df['plot_bins'] = pd.cut(df['score'], bins=np.arange(0,1.1,0.1),
labels=np.arange(0,1,0.1)).round(1)
# CALCULATE PCT OF CATEGORY OUT OF BINs
df['pct'] = (df.groupby(['plot_bins', 'category'])['score'].transform('count')
.div(df.groupby(['plot_bins'])['score'].transform('count')))
# PIVOT TO AGGREGATE + RESHAPE
agg_df = (df.pivot_table(index='plot_bins', columns='category', values='pct', aggfunc='max')
.reset_index(drop=True))
# PLOT
agg_df.plot(kind='bar', stacked=True, rot=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With