I want to draw a boxplot of column Z
in dataframe df
by the categories X
and Y
. How can I sort the boxplot by the median, in descending order?
import pandas as pd
import random
n = 100
# this is probably a strange way to generate random data; please feel free to correct it
df = pd.DataFrame({"X": [random.choice(["A","B","C"]) for i in range(n)],
"Y": [random.choice(["a","b","c"]) for i in range(n)],
"Z": [random.gauss(0,1) for i in range(n)]})
df.boxplot(column="Z", by=["X", "Y"])
Note that this question is very similar, but they use a different data structure. I'm relatively new to pandas (and have only done some tutorials on python in general), so I couldn't figure out how to make my data work with the answer posted there. This may well be more of a reshaping than a plotting question. Maybe there is a solution using groupby
?
You can sort by column values in pandas DataFrame using sort_values() method. To specify the order, you have to use ascending boolean property; False for descending and True for ascending. By default, it is set to True.
If you want to see the median, you can use df. describe(). The 50% value is the median.
To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order.
You can use the answer in How to sort a boxplot by the median values in pandas but first you need to group your data and create a new data frame:
import pandas as pd
import random
import matplotlib.pyplot as plt
n = 100
# this is probably a strange way to generate random data; please feel free to correct it
df = pd.DataFrame({"X": [random.choice(["A","B","C"]) for i in range(n)],
"Y": [random.choice(["a","b","c"]) for i in range(n)],
"Z": [random.gauss(0,1) for i in range(n)]})
grouped = df.groupby(["X", "Y"])
df2 = pd.DataFrame({col:vals['Z'] for col,vals in grouped})
meds = df2.median()
meds.sort_values(ascending=False, inplace=True)
df2 = df2[meds.index]
df2.boxplot()
plt.show()
Similar answer to Alvaro Fuentes' in function form for more portability
import pandas as pd
def boxplot_sorted(df, by, column):
df2 = pd.DataFrame({col:vals[column] for col, vals in df.groupby(by)})
meds = df2.median().sort_values()
df2[meds.index].boxplot(rot=90)
boxplot_sorted(df, by=["X", "Y"], column="Z")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With