In a nutshell, what is my best option for a distribution-type graphs (histogram or kde) when my data is weighted?
df = pd.DataFrame({ 'x':[1,2,3,4], 'wt':[7,5,3,1] })
df.x.plot(kind='hist',weights=df.wt.values)
That works fine but seaborn won't accept a weights kwarg, i.e.
sns.distplot( df.x, bins=4, # doesn't work like this
weights=df.wt.values ) # or with kde=False added
It would also be nice if kde would accept weights but neither pandas nor seaborn seems to allow it.
I realize btw that the data could be expanded to fake the weighting and that's easy here but not of much use with my real data with weights in the hundreds or thousand, so I'm not looking for a workaround like that.
Anyway, that's all. I'm just trying to find out what (if anything) I can do with weighted data besides the basic pandas histogram. I haven't fooled around with bokeh yet, but bokeh suggestions are also welcome.
A weighted histogram shows the weighted distribution of the data. If the histogram displays proportions (rather than raw counts), then the heights of the bars are the sum of the standardized weights of the observations within each bin.
A Distplot or distribution plot, depicts the variation in the data distribution. Seaborn Distplot represents the overall distribution of continuous data variables. The Seaborn module along with the Matplotlib module is used to depict the distplot with different variations in it.
The displot function of Seaborn allows for creating 3 different types of distribution plots which are: Histogram. Kde (kernel density estimate) plot. Ecdf plot.
You have to understand that seaborn uses the very matplotlib plotting functions that also pandas uses.
As the documentation states, sns.distplot
does not accept a weights
argument, however it does take a hist_kws
argument, which will be sent to the underlying call to plt.hist
. Thus, this should do what you want:
sns.distplot(df.x, bins=4, hist_kws={'weights':df.wt.values})
I solved this problem by resampling the data points based on their weight.
You can do it like this:
from random import random
from bisect import bisect
def weighted_choice(choices):
values, weights = zip(*choices)
total = 0
cum_weights = []
for w in weights:
total += w
cum_weights.append(total)
x = random() * total
i = bisect(cum_weights, x)
return values[i]
samples = [([5, 0.5], 0.1), ([0, 10], 0.3), ([0, -4], 0.3)]
choices = np.array([weighted_choice(samples) for c in range(1000)])
sns.distributions.kdeplot(choices[:, 0], choices[:, 1], shade=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With