Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract the bins from seaborn's KDE distplot object?

Background

I am trying to group ~11,000 genes of a species of plant (Arabidopsis thaliana), by data I have of their change in expression levels, in response to light exposure.

The raw values, per gene, are continuous random variables, but I wish to discretise the values to have, say, 20 discrete classes.

So instead of:

change_in_expression = array([-2.2, -1.1, -1.2, ...,  0.6, -1. , -0.9])

I have class outputs:

change_in_expression = array(["3_to_4","2_to_3","1_to_2",...])

What I have tried

I plot the distribution with seaborn's distplot(), which I believe uses a KDE:

import seaborn as sns

d = array([-2.2, -1.1, -1.2, ...,  0.6, -1. , -0.9]) # = change_in_expression

dis = sns.distplot(d, fit=stats.laplace, kde=False)

plt.title("Distribution of Differential Expression")
plt.xlabel("Log2FoldChange in expression")
plt.ylabel("Frequency")
plt.show()

enter image description here

And I am aware that matplotlib.pyplot's hist() allows you to extract bins, when the default settings "automatically" generate these groupings...


Summary Question

The question is, how can I group my genes? This is a broader question than just asking for the seaborn version of matplotlib's hist()... since seaborn's distplot uses a KDE.

Doesn't really seem like I can get the bins from the ax object created by seaborn, by looking at the methods available from:

dir(sns.distplot(d, fit=stats.laplace, kde=False)

I guess, one way would be to inspect the guts of seaborn's distplot source code, to figure out how they bin the data prior to plotting... but that is way beyond my unicorn skillset...

like image 224
Isaac Brockman Avatar asked Mar 03 '23 18:03

Isaac Brockman


2 Answers

Seaborn calls pyplot.hist, which in turn calls numpy.histogram. So one can check what seaborn uses as bins argument if none is specified. I.e. with a being the data,

bins = min(_freedman_diaconis_bins(a), 50)

where 1

def _freedman_diaconis_bins(a):
    """Calculate number of hist bins using Freedman-Diaconis rule."""
    # From https://stats.stackexchange.com/questions/798/
    a = np.asarray(a)
    if len(a) < 2:
        return 1
    h = 2 * iqr(a) / (len(a) ** (1 / 3))
    # fall back to sqrt(a) bins if iqr is 0
    if h == 0:
        return int(np.sqrt(a.size))
    else:
        return int(np.ceil((a.max() - a.min()) / h))

and iqr 2

def iqr(a):
    """Calculate the IQR for an array of numbers."""
    a = np.asarray(a)
    q1 = stats.scoreatpercentile(a, 25)
    q3 = stats.scoreatpercentile(a, 75)
    return q3 - q1

All of this should roughly be the same as

bins = min(len(numpy.histogram_bin_edges(a, bins="fd")), 50)

or

bins = 50 if len(numpy.histogram_bin_edges(a, bins="fd")) > 50 else "fd"

then passing bins to pyplot.hist,

plt.hist(a, bins=bins)
like image 169
ImportanceOfBeingErnest Avatar answered Apr 01 '23 03:04

ImportanceOfBeingErnest


I know this question has been answered already but if really need to extract the bins from a histogram, or bar plot, (things that use patches) you can use this function. I use np.unique() because sometimes you may have multiple bar plots. if you have multiple, this assumes they are overlapping bar plots. I think if you 'dodge' using seaborn this method will not work since it is literally grabbing the placements of the patches (rectangles).

import matplotlib.pyplot as plt
import numpy as np 

def get_bins_from_ax(ax=None):
    if ax is None:
        ax = plt.gca()
    bins_from_patches = np.unique([ax.patches[0].get_x()]+[bar.get_x()+bar.get_width() for bar in ax.patches])
    return bins_from_patches

like image 21
Phillip Maire Avatar answered Apr 01 '23 03:04

Phillip Maire