Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weighted Gaussian kernel density estimation in `python`

Update: Weighted samples are now supported by scipy.stats.gaussian_kde. See here and here for details.

It is currently not possible to use scipy.stats.gaussian_kde to estimate the density of a random variable based on weighted samples. What methods are available to estimate densities of continuous random variables based on weighted samples?

like image 987
Till Hoffmann Avatar asked Dec 23 '14 16:12

Till Hoffmann


People also ask

What is Gaussian kernel density estimation?

Representation of a kernel-density estimate using Gaussian kernels. Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination.

What is kernel density estimation in Python?

Kernel density estimation (KDE) is in some senses an algorithm which takes the mixture-of-Gaussians idea to its logical extreme: it uses a mixture consisting of one Gaussian component per point, resulting in an essentially non-parametric estimator of density.

How are the kernel density weights calculated?

The KDE is calculated by weighting the distances of all the data points we've seen for each location on the blue line. If we've seen more points nearby, the estimate is higher, indicating that probability of seeing a point at that location.

How do you plot KDE in Python?

kdeplot() function is used to plot the data against a single/univariate variable. It represents the probability distribution of the data values as the area under the plotted curve. In the above example, we have generated some random data values using the numpy. random.


2 Answers

Neither sklearn.neighbors.KernelDensity nor statsmodels.nonparametric seem to support weighted samples. I modified scipy.stats.gaussian_kde to allow for heterogeneous sampling weights and thought the results might be useful for others. An example is shown below.

example

An ipython notebook can be found here: http://nbviewer.ipython.org/gist/tillahoffmann/f844bce2ec264c1c8cb5

Implementation details

The weighted arithmetic mean is

weighted arithmetic mean

The unbiased data covariance matrix is then given by unbiased covariance matrix

The bandwidth can be chosen by scott or silverman rules as in scipy. However, the number of samples used to calculate the bandwidth is Kish's approximation for the effective sample size.

like image 77
Till Hoffmann Avatar answered Oct 28 '22 13:10

Till Hoffmann


For univariate distributions you can use KDEUnivariate from statsmodels. It is not well documented, but the fit methods accepts a weights argument. Then you cannot use FFT. Here is an example:

import matplotlib.pyplot as plt
from statsmodels.nonparametric.kde import KDEUnivariate

kde1= KDEUnivariate(np.array([10.,10.,10.,5.]))
kde1.fit(bw=0.5)
plt.plot(kde1.support, [kde1.evaluate(xi) for xi in kde1.support],'x-')

kde1= KDEUnivariate(np.array([10.,5.]))
kde1.fit(weights=np.array([3.,1.]), 
         bw=0.5,
         fft=False)
plt.plot(kde1.support, [kde1.evaluate(xi) for xi in kde1.support], 'o-')

which produces this figure: enter image description here

like image 26
Ramon Crehuet Avatar answered Oct 28 '22 14:10

Ramon Crehuet