Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.7 and Pandas Boxplot connecting median values

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.

I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:

df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)

enter image description here

One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.

like image 305
Patratacus Avatar asked May 18 '17 05:05

Patratacus


1 Answers

You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.

First let's generate some sample data:

import pandas as pd
import numpy as np
import seaborn as sns

N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})

print(df.head())
  group     value
0     A  0.816847
1     A  0.468465
2     C  0.871975
3     B  0.933708
4     A  0.480170
              ...

Next, make the boxplot and save the axis object:

ax = df.boxplot(column='value', by='group', showfliers=True, 
                positions=range(df.group.unique().shape[0]))

Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.

Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:

sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)

boxplot

Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.

like image 166
andrew_reece Avatar answered Sep 30 '22 14:09

andrew_reece