I could get the size info using groupby and add text to the corresponding location. But I can't help thinking there's a better way as this really seems mundane, something many people would like to see...
To illustrate, the following code would generate a grouped boxplot
import pandas as pd
df = pd.DataFrame(rand(100, 1), columns=['value'])
df.ix[:23, 'class']='A'
df.ix[24:, 'class']='B'
df.boxplot(column='value', by='class')
What I'd like is to show the sample size of each class A and B, namely 24 and 76 respectively. It could appear as legend or somewhere near the boxes, either is ok with me.
Thanks!
Sample size differences can be assessed by scaling the box plot width in proportion to √n (Fig. 1b), the factor by which the pre- cision of the sample's estimate of population statistics improves as sample size is increased. To assist in judging differences between sample medians, a notch (Fig.
Steps. Set the figure size and adjust the padding between and around the subplots. Make a Pandas dataframe, i.e., two-dimensional, size-mutable, potentially heterogeneous tabular data. Make a box and whisker plot, using boxplot() method with width tuple to adjust the box in boxplot.
MatPlotLib with Python PyLab is a procedural interface to the Matplotlib object-oriented plotting library. Matplotlib is the whole package; matplotlib. pyplot is a module in Matplotlib; and PyLab is a module that gets installed alongside Matplotlib. PyLab is a convenience module that bulk imports matplotlib.
n in the class ticklabels. I tried it as a legend but I didn't think it was as clear. R has a lot more boxplot options, including making the width of the boxes proportional to sample size; not a default in matplotlib but easy and seems really readable:
import pandas as pd
from numpy.random import rand, randint
df = pd.DataFrame(rand(100, 1), columns=['value'])
cut1 = randint(2,47)
cut2 = randint(52, 97)
df.ix[:cut1, 'class']='A'
df.ix[cut1+1:cut2, 'class']='B'
df.ix[cut2+1:, 'class'] = 'C'
dfg = df.groupby('class')
counts = [len(v) for k, v in dfg]
total = float(sum(counts))
cases = len(counts)
widths = [c/total for c in counts]
cax = df.boxplot(column='value', by='class', widths=widths)
cax.set_xticklabels(['%s\n$n$=%d'%(k, len(v)) for k, v in dfg])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With