I would like to compute an RBF or "Gaussian" kernel for a data matrix X
with n
rows and d
columns. The resulting square kernel matrix is given by:
K[i,j] = var * exp(-gamma * ||X[i] - X[j]||^2)
var
and gamma
are scalars.
What is the fastest way to do this in python?
RBF is the default kernel used within the sklearn's SVM classification algorithm and can be described with the following formula: where gamma can be set manually and has to be >0.
RBF Kernel is popular because of its similarity to K-Nearest Neighborhood Algorithm. It has the advantages of K-NN and overcomes the space complexity problem as RBF Kernel Support Vector Machines just needs to store the support vectors during training and not the entire dataset.
The only difference between the two models is the K in the regularisation term. The key theoretical advantage of the kernel approach is that it allows you to interpret a non-linear model as a linear model following a fixed non-linear transformation that doesn't depend on the sample of data.
In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.
I am going to present four different methods for computing such a kernel, followed by a comparison of their run-time.
Here, I use the fact that ||x-y||^2 = ||x||^2 + ||y||^2 - 2 * x^T * y
.
import numpy as np
X_norm = np.sum(X ** 2, axis = -1)
K = var * np.exp(-gamma * (X_norm[:,None] + X_norm[None,:] - 2 * np.dot(X, X.T)))
numexpr
is a python package that allows for efficient and parallelized array operations on numpy arrays. We can use it as follows to perform the same computation as above:
import numpy as np
import numexpr as ne
X_norm = np.sum(X ** 2, axis = -1)
K = ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : np.dot(X, X.T),
'g' : gamma,
'v' : var
})
scipy.spatial.distance.pdist
We could also use scipy.spatial.distance.pdist
to compute a non-redundant array of pairwise squared euclidean distances, compute the kernel on that array and then transform it to a square matrix:
import numpy as np
from scipy.spatial.distance import pdist, squareform
K = squareform(var * np.exp(-gamma * pdist(X, 'sqeuclidean')))
K[np.arange(K.shape[0]), np.arange(K.shape[1])] = var
sklearn.metrics.pairwise.rbf_kernel
sklearn
provides a built-in method for direct computation of an RBF kernel:
import numpy as np
from sklearn.metrics.pairwise import rbf_kernel
K = var * rbf_kernel(X, gamma = gamma)
I use 25,000 random samples of 512 dimensions for testing and perform experiments on an Intel Core i7-7700HQ (4 cores @ 2.8 GHz). More precisely:
X = np.random.randn(25000, 512)
gamma = 0.01
var = 5.0
Each method is run 7 times and the mean and standard deviation of the time per execution is reported.
| Method | Time |
|-------------------------------------|-------------------|
| numpy | 24.2 s ± 1.06 s |
| numexpr | 8.89 s ± 314 ms |
| scipy.spatial.distance.pdist | 2min 59s ± 312 ms |
| sklearn.metrics.pairwise.rbf_kernel | 13.9 s ± 757 ms |
First of all, scipy.spatial.distance.pdist
is surprisingly slow.
numexpr
is almost 3 times faster than the pure numpy
method, but this speed-up factor will vary with the number of available CPUs.
sklearn.metrics.pairwise.rbf_kernel
is not the fastest way, but only a bit slower than numexpr
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With