Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relation between sigma and bandwidth in gaussian_filter and gaussian_kde

Applying the functions scipy.ndimage.filters.gaussian_filter and scipy.stats.gaussian_kde over a given set of data can give very similar results if the sigma and bw_method parameters in each function respectively are chosen adequately.

For example, I can obtain for a random 2D distribution of points the following plots by setting sigma=2. in the gaussian_filter (left plot) and bw_method=sigma/30. in the gaussian_kde (right plot):

enter image description here

(The MWE is at the bottom of the question)

There's obviously a relation between these parameters since one applies a Gaussian filter and the other one a Gaussian Kernel Density Estimator on the data.

The definition of each parameter is:

  • scipy.ndimage.filters.gaussian_filter, sigma:

sigma : scalar or sequence of scalars Standard deviation for Gaussian kernel. The standard deviations of the Gaussian filter are given for each axis as a sequence, or as a single number, in which case it is equal for all axes.

This one I can understand given the definition of the Gaussian operator:

enter image description here

  • scipy.stats.gaussian_kde, bw_method:

bw_method : str, scalar or callable, optional The method used to calculate the estimator bandwidth. This can be ‘scott’, ‘silverman’, a scalar constant or a callable. If a scalar, this will be used directly as kde.factor. If a callable, it should take a gaussian_kde instance as only parameter and return a scalar. If None (default), ‘scott’ is used. See Notes for more details.

In this case let's assume the input for bw_method is a scalar (float) so as to be comparable with sigma. Here's where I get lost since I can find no information about this kde.factor parameter anywhere.

What I'd like to know is the precise mathematical equation that connects both these parameters (ie: sigma and bw_method when a float is used) if possible.


MWE:

import numpy as np
from scipy.stats import gaussian_kde
from scipy.ndimage.filters import gaussian_filter
import matplotlib.pyplot as plt

def rand_data():
    return np.random.uniform(low=1., high=200., size=(1000,))

# Generate 2D data.
x_data, y_data = rand_data(), rand_data()
xmin, xmax = min(x_data), max(x_data)
ymin, ymax = min(y_data), max(y_data)

# Define grid density.
gd = 100
# Define bandwidth
bw = 2.

# Using gaussian_filter
# Obtain 2D histogram.
rang = [[xmin, xmax], [ymin, ymax]]
binsxy = [gd, gd]
hist1, xedges, yedges = np.histogram2d(x_data, y_data, range=rang, bins=binsxy)
# Gaussian filtered histogram.
h_g = gaussian_filter(hist1, bw)

# Using gaussian_kde
values = np.vstack([x_data, y_data])
# Data 2D kernel density estimate.
kernel = gaussian_kde(values, bw_method=bw / 30.)
# Define x,y grid.
gd_c = complex(0, gd)
x, y = np.mgrid[xmin:xmax:gd_c, ymin:ymax:gd_c]
positions = np.vstack([x.ravel(), y.ravel()])
# Evaluate KDE.
z = kernel(positions)
# Re-shape for plotting
z = z.reshape(gd, gd)

# Make plots.
fig, (ax1, ax2) = plt.subplots(1, 2)
# Gaussian filtered 2D histograms.
ax1.imshow(h_g.transpose(), origin='lower')
ax2.imshow(z.transpose(), origin='lower')

plt.show()
like image 579
Gabriel Avatar asked Sep 09 '14 17:09

Gabriel


1 Answers

There is no relationship because you are doing two different things.

With scipy.ndimage.filters.gaussian_filter, you are filtering a 2D variable (an image) with a kernel, and that kernel happens to be a gaussian. It is essentially smoothing the image.

With scipy.stats.gaussian_kde you try to estimate the probability density function of your 2D-variable. The bandwidth (or smoothing parameter) is your integration step, and should be as small as the data allows.

The two images look the same because your uniform distribution, from which you drew the samples, is not that different from a normal distribution. Obviously you'd get a better estimate with a normal kernel function.

You can read about Kernel density estimation.

Edit: In Kernel Density Estimation (KDE), the kernels are scaled such that the bandwidth is the standard deviation of the smoothing kernel. Which bandwidth to use is not obvious as it depends on the data. There exists an optimal choice for univariate data, called the Silverman's rule of thumb.

To summarize, there is no relationship between the standard deviation of a gaussian filter, and the bandwidth of a KDE, because we're talking oranges and apples. However, talking about KDE only, there is a relationship between the KDE bandwidth and the standard deviation of the same KDE kernel. They are equal! Well in truth implementation details differ, and there may be a scaling that depends on the size of the kernel. You could read your specific package gaussian_kde.py

like image 172
Hugues Fontenelle Avatar answered Oct 16 '22 23:10

Hugues Fontenelle