Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Resampling a KDE (Kernel density estimation) in statsmodels

I am interested in using a sample of points to construct a KDE, and then using that KDE to resample points. scipy.stats.gaussian_kde offers a very simple way to do this. For example, sampling from a gaussian distribution:

import numpy as np
from scipy.stats import gaussian_kde, norm

sampled = np.random.normal(loc = 0, scale = 1, size = 1000)
kde = gaussian_kde(sampled, bw_method = 'silverman')
resampled = kde.resample(1000)

One flaw with scipy.stats.gaussian_kde is that it offers limited choices for bandwidth selection. Reading through this, I was pointed to statsmodels.nonparametric.kernel_density.KDEMultivariate (more information here). This lets me use cross validation to estimate the optimal bandwidth, which is more sophisticated if the underlying pdf you are trying to approximate is not unimodal. For example, using a sum of two gaussians, I can construct a KDE using KDEMultivariate as follows:

from statsmodels.nonparametric.kernel_density import KDEMultivariate
sampled = np.concatenate((np.random.normal(loc = -3, scale = 1, size = 1000), \
                          np.random.normal(loc = 3, scale = 1, size = 1000)))
kde = KDEMultivariate(sampled, 'c', bw = 'cv_ml')

Exploring higher dimensional data with an arbitrary underlying pdf, it is clear that KDEMultivariate is able to produce a PDF that is much more representative of the original PDF. But there is a big issue I am running into -- KDEMultivariate has no kde.resample() method, and I am therefore unable to resample points from my new KDE. Is there a simple and efficient way to resample from a KDE constructed using statsmodels.nonparametric.kernel_density.KDEMultivariate?

like image 447
Enrique Welsh Avatar asked Oct 17 '22 18:10

Enrique Welsh


1 Answers

Using motivation from gaussian_kde in scipy, I wrote a simple resampling procedure. In statsmodels, the bandwidth corresponds to the SD of the gaussian kernels in each dimension. In scipy, the bandwidth^2 is multiplied by the data covariance to construct the covariance matrix.

def resample(kde, size):
    n, d = kde.data.shape
    indices = np.random.randint(0, n, size)
    cov = np.diag(kde.bw)**2
    means = kde.data[indices, :]
    norm = np.random.multivariate_normal(np.zeros(d), cov, size)
    return np.transpose(means + norm)

This takes an instance of KDEMultivariate, and resamples it by selecting random kernels, then sampling from those kernels using the fact that each kernel follows a multivariate normal distribution.

like image 154
Peter Avatar answered Oct 21 '22 09:10

Peter