Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply custom column order (on Categorical) to pandas boxplot?

EDIT: this question arose back in 2013 with pandas ~0.13 and was obsoleted by direct support for boxplot somewhere between version 0.15-0.18 (as per @Cireo's late answer; also pandas greatly improved support for categorical since this was asked.)


I can get a boxplot of a salary column in a pandas DataFrame...

train.boxplot(column='Salary', by='Category', sym='')

...however I can't figure out how to define the index-order used on column 'Category' - I want to supply my own custom order, according to another criterion:

category_order_by_mean_salary = train.groupby('Category')['Salary'].mean().order().keys()

How can I apply my custom column order to the boxplot columns? (other than ugly kludging the column names with a prefix to force ordering)

'Category' is a string (really, should be a categorical, but this was back in 0.13, where categorical was a third-class citizen) column taking 27 distinct values: ['Accounting & Finance Jobs','Admin Jobs',...,'Travel Jobs']. So it can be easily factorized with pd.Categorical.from_array()

On inspection, the limitation is inside pandas.tools.plotting.py:boxplot(), which converts the column object without allowing ordering:

  • pandas.core.frame.py.boxplot() is a passthrough to
  • pandas.tools.plotting.py:boxplot() which instantiates ...
  • matplotlib.pyplot.py:boxplot() which instantiates ...
  • matplotlib.axes.py:boxplot()

I suppose I could either hack up a custom version of pandas boxplot(), or reach into the internals of the object. And also file an enhance request.

like image 854
smci Avatar asked Mar 21 '13 07:03

smci


People also ask

Which of the following parameter is used to change the position of Boxplots?

conf_intervals : This parameter is also an array or sequence whose first dimension is compatible with x and whose second dimension is 2. positions : This parameter is used to sets the positions of the boxes.

How do you make a boxplot for each feature in the dataset?

To draw a box plot for the given data first we need to arrange the data in ascending order and then find the minimum, first quartile, median, third quartile and the maximum. To find the First Quartile we take the first six values and find their median. For the Third Quartile, we take the next six and find their median.


1 Answers

Hard to say how to do this without a working example. My first guess would be to just add an integer column with the orders that you want.

A simple, brute-force way would be to add each boxplot one at a time.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
columns_my_order = ['C', 'A', 'D', 'B']
fig, ax = plt.subplots()
for position, column in enumerate(columns_my_order):
    ax.boxplot(df[column], positions=[position])

ax.set_xticks(range(position+1))
ax.set_xticklabels(columns_my_order)
ax.set_xlim(xmin=-0.5)
plt.show()

enter image description here

like image 82
Paul H Avatar answered Oct 12 '22 05:10

Paul H