Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a matplotlib or seaborn histogram which uses percent rather than count?

Specifically I'm dealing with the Kaggle Titanic dataset. I've plotted a stacked histogram which shows ages that survived and died upon the titanic. Code below.

figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'], data[data['Survived']==0]['Age']], stacked=True, bins=30, label=['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()

I would like to alter the chart to show a single chart per bin of the percentage in that age group that survived. E.g. if a bin contained the ages between 10-20 years of age and 60% of people aboard the titanic in that age group survived, then the height would line up 60% along the y-axis.

Edit: I may have given a poor explanation to what I'm looking for. Rather than alter the y-axis values, I'm looking to change the actual shape of the bars based on the percentage that survived.

The first bin on the graph shows roughly 65% survived in that age group. I would like this bin to line up against the y-axis at 65%. The following bins look to be 90%, 50%, 10% respectively, and so on.

The graph would end up actually looking something like this:

enter image description here

like image 931
WillacyMe Avatar asked Oct 17 '16 17:10

WillacyMe


People also ask

What is the method used in MatPlotLib to generate histogram?

In Matplotlib, we use the hist() function to create histograms. The hist() function will use an array of numbers to create a histogram, the array is sent into the function as an argument.

Which of the following is used to create histogram using Seaborn?

The plotting library Seaborn has built-in function to make histogram. The Seaborn function to make histogram is “distplot” for distribution plot. As usual, Seaborn's distplot can take the column from Pandas dataframe as argument to make histogram.

How do I make my MatPlotLib histogram look better?

We can achieve this by increasing the number of bins, which is essentially the number of classes the histogram divides the data into. More bins will make the histogram smoother.

What is the difference between histogram and Countplot?

countplot. Show the counts of observations in each categorical bin using bars. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.


2 Answers

Perhaps the following will help ...

  1. Split the dataframe based on 'Survived'

    df_survived=df[df['Survived']==1]
    df_not_survive=df[df['Survived']==0]
    
  2. Create Bins

    age_bins=np.linspace(0,80,21)
    
  3. Use np.histogram to generate histogram data

    survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
    not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
    
  4. Calculate survival rate in each bin

    surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
    
  5. Plot

    plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
    plt.xlabel('Age')
    plt.ylabel('Survival Rate')
    

enter image description here

like image 113
bahaugen Avatar answered Oct 16 '22 19:10

bahaugen


For Seaborn, use the parameter stat. According to the documentation, currently supported values for the stat parameter are:

  • count shows the number of observations
  • frequency shows the number of observations divided by the bin width
  • density normalizes counts so that the area of the histogram is 1
  • probability normalizes counts so that the sum of the bar heights is 1

The result with stat being count:

seaborn.histplot(
    data=data,
    x='variable',
    discrete=True,
    stat='count'
)

Histogram result for stat=count

The result after stat is changed to probability:

seaborn.histplot(
    data=data,
    x='variable',
    discrete=True,
    stat='probability'
)

Histogram result for stat=probability

like image 3
miro Avatar answered Oct 16 '22 19:10

miro