Specifically I'm dealing with the Kaggle Titanic dataset. I've plotted a stacked histogram which shows ages that survived and died upon the titanic. Code below.
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'], data[data['Survived']==0]['Age']], stacked=True, bins=30, label=['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()
I would like to alter the chart to show a single chart per bin of the percentage in that age group that survived. E.g. if a bin contained the ages between 10-20 years of age and 60% of people aboard the titanic in that age group survived, then the height would line up 60% along the y-axis.
Edit: I may have given a poor explanation to what I'm looking for. Rather than alter the y-axis values, I'm looking to change the actual shape of the bars based on the percentage that survived.
The first bin on the graph shows roughly 65% survived in that age group. I would like this bin to line up against the y-axis at 65%. The following bins look to be 90%, 50%, 10% respectively, and so on.
The graph would end up actually looking something like this:
In Matplotlib, we use the hist() function to create histograms. The hist() function will use an array of numbers to create a histogram, the array is sent into the function as an argument.
The plotting library Seaborn has built-in function to make histogram. The Seaborn function to make histogram is “distplot” for distribution plot. As usual, Seaborn's distplot can take the column from Pandas dataframe as argument to make histogram.
We can achieve this by increasing the number of bins, which is essentially the number of classes the histogram divides the data into. More bins will make the histogram smoother.
countplot. Show the counts of observations in each categorical bin using bars. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.
Perhaps the following will help ...
Split the dataframe based on 'Survived'
df_survived=df[df['Survived']==1]
df_not_survive=df[df['Survived']==0]
Create Bins
age_bins=np.linspace(0,80,21)
Use np.histogram to generate histogram data
survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
Calculate survival rate in each bin
surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
Plot
plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
plt.xlabel('Age')
plt.ylabel('Survival Rate')
For Seaborn, use the parameter stat
. According to the documentation, currently supported values for the stat
parameter are:
count
shows the number of observationsfrequency
shows the number of observations divided by the bin widthdensity
normalizes counts so that the area of the histogram is 1probability
normalizes counts so that the sum of the bar heights is 1The result with stat
being count
:
seaborn.histplot(
data=data,
x='variable',
discrete=True,
stat='count'
)
The result after stat
is changed to probability
:
seaborn.histplot(
data=data,
x='variable',
discrete=True,
stat='probability'
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With