I am trying to group ~11,000 genes of a species of plant (Arabidopsis thaliana), by data I have of their change in expression levels, in response to light exposure.
The raw values, per gene, are continuous random variables, but I wish to discretise the values to have, say, 20 discrete classes.
So instead of:
change_in_expression = array([-2.2, -1.1, -1.2, ..., 0.6, -1. , -0.9])
I have class outputs:
change_in_expression = array(["3_to_4","2_to_3","1_to_2",...])
I plot the distribution with seaborn's distplot()
, which I believe uses a KDE:
import seaborn as sns
d = array([-2.2, -1.1, -1.2, ..., 0.6, -1. , -0.9]) # = change_in_expression
dis = sns.distplot(d, fit=stats.laplace, kde=False)
plt.title("Distribution of Differential Expression")
plt.xlabel("Log2FoldChange in expression")
plt.ylabel("Frequency")
plt.show()
And I am aware that matplotlib.pyplot's
hist()
allows you to extract bins, when the default settings "automatically" generate these groupings...
The question is, how can I group my genes? This is a broader question than just asking for the seaborn
version of matplotlib's
hist()
... since seaborn's
distplot
uses a KDE.
Doesn't really seem like I can get the bins from the ax
object created by seaborn, by looking at the methods available from:
dir(sns.distplot(d, fit=stats.laplace, kde=False)
I guess, one way would be to inspect the guts of seaborn's distplot
source code, to figure out how they bin the data prior to plotting... but that is way beyond my unicorn skillset...
Seaborn calls pyplot.hist
, which in turn calls numpy.histogram
. So one can check what seaborn uses as bins
argument if none is specified. I.e. with a
being the data,
bins = min(_freedman_diaconis_bins(a), 50)
where 1
def _freedman_diaconis_bins(a):
"""Calculate number of hist bins using Freedman-Diaconis rule."""
# From https://stats.stackexchange.com/questions/798/
a = np.asarray(a)
if len(a) < 2:
return 1
h = 2 * iqr(a) / (len(a) ** (1 / 3))
# fall back to sqrt(a) bins if iqr is 0
if h == 0:
return int(np.sqrt(a.size))
else:
return int(np.ceil((a.max() - a.min()) / h))
and iqr
2
def iqr(a):
"""Calculate the IQR for an array of numbers."""
a = np.asarray(a)
q1 = stats.scoreatpercentile(a, 25)
q3 = stats.scoreatpercentile(a, 75)
return q3 - q1
All of this should roughly be the same as
bins = min(len(numpy.histogram_bin_edges(a, bins="fd")), 50)
or
bins = 50 if len(numpy.histogram_bin_edges(a, bins="fd")) > 50 else "fd"
then passing bins
to pyplot.hist
,
plt.hist(a, bins=bins)
I know this question has been answered already but if really need to extract the bins from a histogram, or bar plot, (things that use patches) you can use this function. I use np.unique() because sometimes you may have multiple bar plots. if you have multiple, this assumes they are overlapping bar plots. I think if you 'dodge'
using seaborn this method will not work since it is literally grabbing the placements of the patches (rectangles).
import matplotlib.pyplot as plt
import numpy as np
def get_bins_from_ax(ax=None):
if ax is None:
ax = plt.gca()
bins_from_patches = np.unique([ax.patches[0].get_x()]+[bar.get_x()+bar.get_width() for bar in ax.patches])
return bins_from_patches
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With