I'm attempting to compare the performance of sklearn.neighbors.KernelDensity versus scipy.stats.gaussian_kde for a two dimensional array.
From this article I see that the bandwidths (bw) are treated differently in each function. The article gives a recipe for setting the correct bw in scipy
so it will be equivalent to the one used in sklearn
. Basically it divides the bw by the sample standard deviation. The result is this:
# For sklearn
bw = 0.15
# For scipy
bw = 0.15/x.std(ddof=1)
where x
is the sample array I'm using to obtain the KDE. This works just fine in 1D, but I can't make it work in 2D.
Here's a MWE
of what I got:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
# Generate random data.
n = 1000
m1, m2 = np.random.normal(0.2, 0.2, size=n), np.random.normal(0.2, 0.2, size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
kernel = stats.gaussian_kde(values, bw_method=0.15/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1,y1))
print 'iso1 = ', iso[0]
# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=0.15).fit(zip(*values))
# Get KDE value for the point.
iso2 = kernel_sk.score_samples([[x1, y1]])
print 'iso2 = ', np.exp(iso2[0])
( iso2
is presented as an exponential since sklearn
returns the log values)
The results I get for iso1
and iso2
are different and I'm lost as to how should I affect the bandwidth (in either function) to make them equal (as they should).
Add
I was advised over at sklearn
chat (by ep) that I should scale the values in (x,y)
before calculating the kernel with scipy
in order to obtain comparable results with sklearn
.
So this is what I did:
# Scale values.
x_val_sca = np.asarray(values[0])/np.asarray(values).std(axis=1)[0]
y_val_sca = np.asarray(values[1])/np.asarray(values).std(axis=1)[1]
values = [x_val_sca, y_val_sca]
kernel = stats.gaussian_kde(values, bw_method=bw_value)
ie: I scaled both dimensions before getting the kernel with scipy
while leaving the line that obtains the kernel in sklearn
untouched.
This gave better results but there's still differences in the kernels obtained:
where the red dot is the (x1,y1)
point in the code. So as can be seen, there are still differences in the shapes of the density estimates, albeit very small ones. Perhaps this is the best that can be achieved?
The KDE algorithm takes a parameter, bandwidth, that affects how “smooth” the resulting curve is. Use the control below to modify bandwidth, and notice how the estimate changes. The KDE is calculated by weighting the distances of all the data points we've seen for each location on the blue line.
Both give us estimates of an unknown density function based on observation data. The algorithms for the calculation of histograms and KDEs are very similar. KDEs offer much greater flexibility because we can not only vary the bandwidth, but also use kernels of different shapes and sizes.
Kernel Density Estimation (KDE) It is estimated simply by adding the kernel values (K) from all Xj. With reference to the above table, KDE for whole data set is obtained by adding all row values. The sum is then normalized by dividing the number of data points, which is six in this example.
In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights.
A couple of years later I tried this and think I got it to work with no re-scaling needed for the data. Bandwidth values do need some scaling though:
# For sklearn
bw = 0.15
# For scipy
bw = 0.15/x.std(ddof=1)
The evaluation of both KDEs for the same point is not exactly equal. For example here's an evaluation for the (x1, y1)
point:
iso1 = 0.00984751705005 # Scipy
iso2 = 0.00989788224787 # Sklearn
but I guess it's close enough.
Here's the MWE for the 2D case and the output which, as far as I can see, look almost exactly the same:
import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
# Generate random data.
n = 1000
m1, m2 = np.random.normal(-3., 3., size=n), np.random.normal(-3., 3., size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
ext_range = [xmin, xmax, ymin, ymax]
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# Bandwidth value.
bw = 0.15
# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
# **Bandwidth needs to be scaled to match Sklearn results**
kernel = stats.gaussian_kde(
values, bw_method=bw/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1, y1))
print 'iso1 = ', iso1[0]
# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=bw).fit(zip(*values))
# Get KDE value for the point. Use exponential since sklearn returns the
# log values
iso2 = np.exp(kernel_sk.score_samples([[x1, y1]]))
print 'iso2 = ', iso2[0]
# Plot
fig = plt.figure(figsize=(10, 10))
gs = gridspec.GridSpec(1, 2)
# Scipy
plt.subplot(gs[0])
plt.title("Scipy", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos = kernel(positions)
kde = np.reshape(k_pos.T, x.shape)
plt.imshow(np.rot90(kde), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde, 5, colors='k', linewidths=0.6)
# Sklearn
plt.subplot(gs[1])
plt.title("Sklearn", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos2 = np.exp(kernel_sk.score_samples(zip(*positions)))
kde2 = np.reshape(k_pos2.T, x.shape)
plt.imshow(np.rot90(kde2), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde2, 5, colors='k', linewidths=0.6)
fig.tight_layout()
plt.savefig('KDEs', dpi=300, bbox_inches='tight')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With