Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distribution-type graphs (histogram/kde) with weighted data

In a nutshell, what is my best option for a distribution-type graphs (histogram or kde) when my data is weighted?

df = pd.DataFrame({ 'x':[1,2,3,4], 'wt':[7,5,3,1] })

df.x.plot(kind='hist',weights=df.wt.values)

That works fine but seaborn won't accept a weights kwarg, i.e.

sns.distplot( df.x, bins=4,              # doesn't work like this
              weights=df.wt.values )     # or with kde=False added

It would also be nice if kde would accept weights but neither pandas nor seaborn seems to allow it.

I realize btw that the data could be expanded to fake the weighting and that's easy here but not of much use with my real data with weights in the hundreds or thousand, so I'm not looking for a workaround like that.

Anyway, that's all. I'm just trying to find out what (if anything) I can do with weighted data besides the basic pandas histogram. I haven't fooled around with bokeh yet, but bokeh suggestions are also welcome.

like image 867
JohnE Avatar asked Apr 27 '15 02:04

JohnE


People also ask

What is a weighted histogram?

A weighted histogram shows the weighted distribution of the data. If the histogram displays proportions (rather than raw counts), then the heights of the bars are the sum of the standardized weights of the observations within each bin.

What does a Distplot show?

A Distplot or distribution plot, depicts the variation in the data distribution. Seaborn Distplot represents the overall distribution of continuous data variables. The Seaborn module along with the Matplotlib module is used to depict the distplot with different variations in it.

What are the distribution plots in Seaborn?

The displot function of Seaborn allows for creating 3 different types of distribution plots which are: Histogram. Kde (kernel density estimate) plot. Ecdf plot.


2 Answers

You have to understand that seaborn uses the very matplotlib plotting functions that also pandas uses.

As the documentation states, sns.distplot does not accept a weights argument, however it does take a hist_kws argument, which will be sent to the underlying call to plt.hist. Thus, this should do what you want:

sns.distplot(df.x, bins=4, hist_kws={'weights':df.wt.values}) 
like image 199
hitzg Avatar answered Sep 30 '22 17:09

hitzg


I solved this problem by resampling the data points based on their weight.

You can do it like this:

from random import random
from bisect import bisect

def weighted_choice(choices):
    values, weights = zip(*choices)
    total = 0
    cum_weights = []
    for w in weights:
        total += w
        cum_weights.append(total)
    x = random() * total
    i = bisect(cum_weights, x)
    return values[i]

samples = [([5, 0.5], 0.1), ([0, 10], 0.3), ([0, -4], 0.3)]
choices = np.array([weighted_choice(samples) for c in range(1000)])
sns.distributions.kdeplot(choices[:, 0], choices[:, 1], shade=True)

img

like image 45
Andres Romero Avatar answered Sep 30 '22 18:09

Andres Romero