Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize kde of scikit learn?

Tags:

Let's say I have an array of shape (100000,1), representing samples of variable X of uniform distribution between 0 and 1. I want to approximate the density of probability of this variable, and I use Scikit-Learn KernelDensity to do that.

The problem is I only get a result which is not normalized. The integral of probability density does not sum to 1. How should I do to normalize automatically ? Am I doing something wrong ?

def kde_sklearn(data, grid, **kwargs):
    """
    Kernel Density Estimation with Scikit-learn

    Parameters
    ----------
    data : numpy.array
        Data points used to compute a density estimator. It
        has `n x p` dimensions, representing n points and p
        variables.
    grid : numpy.array
        Data points at which the desity will be estimated. It
        has `m x p` dimensions, representing m points and p
        variables.

    Returns
    -------
    out : numpy.array
        Density estimate. Has `m x 1` dimensions
    """
    kde_skl = KernelDensity(**kwargs)
    kde_skl.fit(data)
    # score_samples() returns the log-likelihood of the samples
    log_pdf = kde_skl.score_samples(grid)
    return np.exp(log_pdf) 

X = np.random.uniform(0,1,1000).reshape(-1,1)
X1 = np.linspace(0,1,100)[:,np.newaxis]

kde_sklearn(X,X1,kernel='tophat')

Out[43]: 
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

I expected to have a vector of 1 since the integral should sum to 1.

like image 510
Raphael Benezra Avatar asked Aug 09 '19 13:08

Raphael Benezra


2 Answers

The problem isn't with normalization, as I can show from an example. Suppose that I run the following code that fits a KDE to samples from a standard normal distribution:

import numpy as np
import sklearn.neighbors as sn

# Sample from a standard normal distribution
XX = np.random.randn(1000).reshape(-1, 1)

# Fit a KDE
kde_sklg = sn.KernelDensity()
kde_sklg.fit(XX)

# Get estimated densities
XX1 = np.linspace(-4.0, 4.0, 100)[:, np.newaxis]
gdens = np.exp(kde_sklg.score_samples(XX1))

I can then estimate the area under the PDF with the trapezoid rule as follows:

my_area = 0.0
for i in range(1,gdens.shape[0]):
    my_area += 0.5*(gdens[i] + gdens[i-1])*(XX1[i,0] - XX1[i-1,0])

The estimated area (my_area) that I get is about 0.996, pretty close to 1.

The problem is that your KDE isn't handling the jumps in your uniform PDF that occur at 0 and 1, so it smears them out too much. About half the area under the KDE's estimate of your PDF then ends up beneath those smeared regions. If you replace the value of your X1 with, say, X2 = np.linspace(-1,2,200)[:,np.newaxis], you can see that there is significant density in the parts of the KDE's estimate of the PDF over the intervals [-1,0] and [1,2].

like image 128
jjramsey Avatar answered Sep 30 '22 19:09

jjramsey


The posted answers are not clear in my opinion and thus, I am providing another answer.

Briefly, the integral sums to 1, not the probabilities. Below I show 2 ways to get the integral that is indeed equal to 1.

import numpy as np
from sklearn.neighbors import KernelDensity

np.random.seed(1)

# some uniform data
X = np.random.uniform(-5,5,100).reshape(-1,1)

# grid to be used later0
grid = np.linspace(-5,5,1000)[:,np.newaxis]

# fit using the data
kde = KernelDensity(kernel = 'tophat', bandwidth= 0.5).fit(X)

# get log probailities of the grid
log_dens = kde.score_samples(grid)

# transform log prob to prob
probs = np.exp(log_dens)

# Integrate
print(np.trapz(probs.ravel(), grid.ravel()))
0.9732232232232225

plt.hist(X, density=True, bins=30)
plt.plot(grid.ravel(),probs.ravel())
plt.show()

Note that another way to get the integral is the following since we have the same step in the defined grid:

np.sum(probs*np.diff(grid.ravel())[0])
0.9732232232232225

enter image description here

like image 27
seralouk Avatar answered Sep 30 '22 19:09

seralouk