Let's say I have an array of shape (100000,1), representing samples of variable X of uniform distribution between 0 and 1. I want to approximate the density of probability of this variable, and I use Scikit-Learn KernelDensity to do that.
The problem is I only get a result which is not normalized. The integral of probability density does not sum to 1. How should I do to normalize automatically ? Am I doing something wrong ?
def kde_sklearn(data, grid, **kwargs):
"""
Kernel Density Estimation with Scikit-learn
Parameters
----------
data : numpy.array
Data points used to compute a density estimator. It
has `n x p` dimensions, representing n points and p
variables.
grid : numpy.array
Data points at which the desity will be estimated. It
has `m x p` dimensions, representing m points and p
variables.
Returns
-------
out : numpy.array
Density estimate. Has `m x 1` dimensions
"""
kde_skl = KernelDensity(**kwargs)
kde_skl.fit(data)
# score_samples() returns the log-likelihood of the samples
log_pdf = kde_skl.score_samples(grid)
return np.exp(log_pdf)
X = np.random.uniform(0,1,1000).reshape(-1,1)
X1 = np.linspace(0,1,100)[:,np.newaxis]
kde_sklearn(X,X1,kernel='tophat')
Out[43]:
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
I expected to have a vector of 1 since the integral should sum to 1.
The problem isn't with normalization, as I can show from an example. Suppose that I run the following code that fits a KDE to samples from a standard normal distribution:
import numpy as np
import sklearn.neighbors as sn
# Sample from a standard normal distribution
XX = np.random.randn(1000).reshape(-1, 1)
# Fit a KDE
kde_sklg = sn.KernelDensity()
kde_sklg.fit(XX)
# Get estimated densities
XX1 = np.linspace(-4.0, 4.0, 100)[:, np.newaxis]
gdens = np.exp(kde_sklg.score_samples(XX1))
I can then estimate the area under the PDF with the trapezoid rule as follows:
my_area = 0.0
for i in range(1,gdens.shape[0]):
my_area += 0.5*(gdens[i] + gdens[i-1])*(XX1[i,0] - XX1[i-1,0])
The estimated area (my_area
) that I get is about 0.996, pretty close to 1.
The problem is that your KDE isn't handling the jumps in your uniform PDF that occur at 0 and 1, so it smears them out too much. About half the area under the KDE's estimate of your PDF then ends up beneath those smeared regions. If you replace the value of your X1
with, say, X2 = np.linspace(-1,2,200)[:,np.newaxis]
, you can see that there is significant density in the parts of the KDE's estimate of the PDF over the intervals [-1,0] and [1,2].
The posted answers are not clear in my opinion and thus, I am providing another answer.
Briefly, the integral
sums to 1, not the probabilities. Below I show 2 ways to get the integral that is indeed equal to 1.
import numpy as np
from sklearn.neighbors import KernelDensity
np.random.seed(1)
# some uniform data
X = np.random.uniform(-5,5,100).reshape(-1,1)
# grid to be used later0
grid = np.linspace(-5,5,1000)[:,np.newaxis]
# fit using the data
kde = KernelDensity(kernel = 'tophat', bandwidth= 0.5).fit(X)
# get log probailities of the grid
log_dens = kde.score_samples(grid)
# transform log prob to prob
probs = np.exp(log_dens)
# Integrate
print(np.trapz(probs.ravel(), grid.ravel()))
0.9732232232232225
plt.hist(X, density=True, bins=30)
plt.plot(grid.ravel(),probs.ravel())
plt.show()
Note that another way to get the integral is the following since we have the same step in the defined grid:
np.sum(probs*np.diff(grid.ravel())[0])
0.9732232232232225
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With