Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF estimation in Scikit-Learn KDE

I am trying to compute PDF estimate from KDE computed using scikit-learn module. I have seen 2 variants of scoring and I am trying both: Statement A and B below.

Statement A results in following error:

AttributeError: 'KernelDensity' object has no attribute 'tree_'

Statement B results in following error:

ValueError: query data dimension must match training data dimension

Seems like a silly error, but I cannot figure out. Please help. Code is below...

from sklearn.neighbors import KernelDensity
import numpy

# d is my 1-D array data
xgrid = numpy.linspace(d.min(), d.max(), 1000)

density = KernelDensity(kernel='gaussian', bandwidth=0.08804).fit(d)

# statement A
density_score = KernelDensity(kernel='gaussian', bandwidth=0.08804).score_samples(xgrid)

# statement B
density_score = density.score_samples(xgrid)

density_score = numpy.exp(density_score)

If it helps, I am using 0.15.2 version of scikit-learn. I've tried this successfully with scipy.stats.gaussian_kde so there is no problem with data.

like image 956
mlworker Avatar asked Dec 17 '14 06:12

mlworker


People also ask

Is PDF same as KDE?

Kernel density estimation or KDE is a non-parametric way to estimate the probability density function of a random variable. In other words the aim of KDE is to find probability density function (PDF) for a given dataset.

How do you evaluate KDE?

Kernel Density Estimation (KDE) It is estimated simply by adding the kernel values (K) from all Xj. With reference to the above table, KDE for whole data set is obtained by adding all row values. The sum is then normalized by dividing the number of data points, which is six in this example.

What is KDE in distribution plot?

A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analogous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions.

What is a Gaussian kernel density estimate?

Representation of a kernel-density estimate using Gaussian kernels. Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination.


2 Answers

With statement B, I had the same issue with this error:

 ValueError: query data dimension must match training data dimension

The issue here is that you have 1-D array data, but when you feed it to fit() function, it makes an assumption that you have only 1 data point with many dimensions! So for example, if your training data size is 100000 points, the your d is 100000x1, but fit takes them as 1x100000!!

So, you should reshape your d before fitting: d.reshape(-1,1) and same for xgrid.shape(-1,1)

density = KernelDensity(kernel='gaussian', bandwidth=0.08804).fit(d.reshape(-1,1))
density_score = density.score_samples(xgrid.reshape(-1,1))

Note: The issue with statement A, is that you are using score_samples on an object which is not fit yet!

like image 194
Vahid Mirjalili Avatar answered Oct 17 '22 02:10

Vahid Mirjalili


You need to call the fit() function before you can sample from the distribution.

like image 23
user1793558 Avatar answered Oct 17 '22 00:10

user1793558