Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find Local maxima in Kernel Density Estimation?

I'm trying to make a filter (to remove outlier and noise) using kernel density estimators(KDE). I applied KDE in my 3D (d=3) data points and that gives me the probability density function (PDF) f(x). Now as we know local maxima of density estimation f(x) defined the centers of the clusters of data points. So my idea is to define the appropriate f(x) which will determine those clusters.

My question is how and what method will be better suited for this purpose of finding local maxima in f(x). If anyone can provide me some example code/ idea I will really appreciate it.

Here is the code to find the KDE which give f(x) in 3D data.

import numpy as np
from scipy import stats

data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
         [2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
         [1, 4, 3], [5, .5, 0], [2, .5, 1.2]])
data = data.T 
kde = stats.gaussian_kde(data)
minima = data.T.min(axis=0)
maxima = data.T.max(axis=0)
space = [np.linspace(mini,maxi,20) for mini, maxi in zip(minima,maxima)]
grid = np.meshgrid(*space)
coords = np.vstack(map(np.ravel, grid))
#Evaluate the KD estimated pdf at each coordinate
density = kde(coords)
like image 481
jquery404 Avatar asked Jul 03 '15 03:07

jquery404


People also ask

How is kernel density estimation calculated?

Kernel Density Estimation (KDE) It is estimated simply by adding the kernel values (K) from all Xj. With reference to the above table, KDE for whole data set is obtained by adding all row values. The sum is then normalized by dividing the number of data points, which is six in this example.

What is H in kernel density estimation?

Its kernel density estimator is. where K is the kernel — a non-negative function — and h > 0 is a smoothing parameter called the bandwidth. A kernel with subscript h is called the scaled kernel and defined as Kh(x) = 1/h K(x/h).

What is the Y-axis of kernel density?

The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis.


3 Answers

You will want to use an algorithm called Mean Shift. Its a clustering algorithm that works by finding the modes (aka maxima of f(x)) of the KDE. Note that the bandwidth set for your KDE will impact the number of modes and their locations. Since you are using python, there is an implementation in scikit-learn.

like image 174
Raff.Edward Avatar answered Oct 19 '22 16:10

Raff.Edward


Here is a short function that demonstrates how you could estimate the maxima. Note: the higher the number of no_samples the more accurate the maxima.

from scipy.stats import gaussian_kde
import numpy as np

 def estimate_maxima(data):
    kde = gaussian_kde(data)
    no_samples = 10
    samples = np.linspace(min(data), max(data), no_samples)
    probs = kde.evaluate(samples)
    maxima_index = probs.argmax()
    maxima = samples[maxima_index]
    
    return maxima
like image 39
brandon musa Avatar answered Oct 19 '22 16:10

brandon musa


You could use scipy.optimize.

Example on 1D-data:

import numpy as np
from scipy import optimize
from scipy import stats


# Generate some random data
shape, loc, scale = .5, 3, 10
n = 1000
data = np.sort(stats.lognorm.rvs(shape, loc, scale, size=n))

kernel = stats.gaussian_kde(data)
# Minimize the negative instead of maximizing
# Depending on the shape of your data, you might want to set some bounds
opt = optimize.minimize_scalar(lambda x: -kernel(x))
opt

     fun: array([-0.08363781])
    nfev: 21
     nit: 14
 success: True
       x: array([10.77361776])

The actual mode of this distribution is at

mode = scale/np.exp(shape**2) + loc
mode
10.788007830714049

Plotting the results:

import matplotlib.pyplot as plt

data_es = np.linspace(0, data.max(), 201)  # x-axis points
ecdf = (np.arange(n) + 1)/n  # empirical CDF

fig, axes = plt.subplots(2, 1, sharex=True, dpi=300, figsize=(6,7))
axes[0].hist(x, bins=30, density=True, alpha=.5, rwidth=.9)  # histogram
axes[0].plot(data_es, kernel.pdf(data_es), 'C0')  # estimated PDF
axes[0].plot(data_es, stats.lognorm.pdf(data_es, shape, loc, scale), 'k--', alpha=.5)  # true PDF
axes[0].plot(opt.x, kernel.pdf(opt.x), 'C0.')  # estimated mode
axes[0].plot(mode, stats.lognorm.pdf(mode, shape, loc, scale), 'k.', alpha=.5)  # true mode

axes[1].plot(np.sort(data), ecdf)  # estimated CDF
axes[1].plot(opt.x, np.interp(opt.x, np.sort(data), ecdf), 'C0.')  #estimated mode
axes[1].plot(data_es, stats.lognorm.cdf(data_es, shape, loc, scale), 'k--', alpha=.5)  # true CDF
axes[1].plot(mode, stats.lognorm.cdf(mode, shape, loc, scale), 'k.', alpha=.5)  # true mode

fig.tight_layout()

probability distribution

As you can see, the estimated mode fits pretty well. I assume it can be expanded to multi-variate data with other methods from scipy.optimize.

like image 2
Aubergine Avatar answered Oct 19 '22 14:10

Aubergine