Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas histogram df.hist() group by

How to plot a histogram with pandas DataFrame.hist() using group by? I have a data frame with 5 columns: "A", "B", "C", "D" and "Group"

There are two Groups classes: "yes" and "no"

Using:

df.hist() 

I get the hist for each of the 4 columns.

enter image description here

Now I would like to get the same 4 graphs but with blue bars (group="yes") and red bars (group = "no").

I tried this withouth success:

df.hist(by = "group")

pandas hist went wrong

like image 919
Hangon Avatar asked Aug 25 '17 14:08

Hangon


People also ask

How can you plot histogram for a particular column Column_name of a DataFrame?

Make a histogram for a specific column. By default, you will get a histogram for each column of your dataframe. If you want only a specific column plot, then use the column parameter of the hist() function. You can give the specific column name as the input to the function.

How do you do Groupby in pandas?

The Hello, World! of pandas GroupBy You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .

What are bins in histogram pandas?

Bins are the buckets that your histogram will be grouped by. On the back end, Pandas will group your data into bins, or buckets. Then pandas will count how many values fell into that bucket, and plot the result.


2 Answers

Using Seaborn

If you are open to use Seaborn, a plot with multiple subplots and multiple variables within each subplot can easily be made using seaborn.FacetGrid.

import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)

df2 = pd.melt(df, id_vars='group', value_vars=list("ABCD"), value_name='value')

bins=np.linspace(df2.value.min(), df2.value.max(), 10)
g = sns.FacetGrid(df2, col="variable", hue="group", palette="Set1", col_wrap=2)
g.map(plt.hist, 'value', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

enter image description here

like image 124
ImportanceOfBeingErnest Avatar answered Oct 19 '22 11:10

ImportanceOfBeingErnest


This is not the most flexible workaround but will work for your question specifically.

def sephist(col):
    yes = df[df['group'] == 'yes'][col]
    no = df[df['group'] == 'no'][col]
    return yes, no

for num, alpha in enumerate('abcd'):
    plt.subplot(2, 2, num)
    plt.hist(sephist(alpha)[0], bins=25, alpha=0.5, label='yes', color='b')
    plt.hist(sephist(alpha)[1], bins=25, alpha=0.5, label='no', color='r')
    plt.legend(loc='upper right')
    plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

enter image description here

You could make this more generic by:

  • adding a df and by parameter to sephist: def sephist(df, by, col)
  • making the subplots loop more flexible: for num, alpha in enumerate(df.columns)

Because the first argument to matplotlib.pyplot.hist can take

either a single array or a sequency of arrays which are not required to be of the same length

...an alternattive would be:

for num, alpha in enumerate('abcd'):
    plt.subplot(2, 2, num)
    plt.hist((sephist(alpha)[0], sephist(alpha)[1]), bins=25, alpha=0.5, label=['yes', 'no'], color=['r', 'b'])
    plt.legend(loc='upper right')
    plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

enter image description here

like image 25
Brad Solomon Avatar answered Oct 19 '22 11:10

Brad Solomon