<p>I would like to compute an RBF or "Gaussian" kernel for a data matrix <code>X</code> with <code>n</code> rows and <code>d</code> columns. The resulting square kernel matrix is given by:</p> <pre class="prettyprint"><code>K[i,j] = var * exp(-gamma * ||X[i] - X[j]||^2) </code></pre> <p><code>var</code> and <code>gamma</code> are scalars.</p> <p>What is the fastest way to do this in python?</p>

<p>I am going to present four different methods for computing such a kernel, followed by a comparison of their run-time.</p> <h3>Using pure numpy</h3> <p>Here, I use the fact that <code>||x-y||^2 = ||x||^2 + ||y||^2 - 2 * x^T * y</code>.</p> <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np X_norm = np.sum(X ** 2, axis = -1) K = var * np.exp(-gamma * (X_norm[:,None] + X_norm[None,:] - 2 * np.dot(X, X.T))) </code></pre> <h3>Using numexpr</h3> <p><code>numexpr</code> is a python package that allows for efficient and parallelized array operations on numpy arrays. We can use it as follows to perform the same computation as above:</p> <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np import numexpr as ne X_norm = np.sum(X ** 2, axis = -1) K = ne.evaluate('v * exp(-g * (A + B - 2 * C))', { 'A' : X_norm[:,None], 'B' : X_norm[None,:], 'C' : np.dot(X, X.T), 'g' : gamma, 'v' : var }) </code></pre> <h3>Using <code>scipy.spatial.distance.pdist</code> </h3> <p>We could also use <code>scipy.spatial.distance.pdist</code> to compute a non-redundant array of pairwise squared euclidean distances, compute the kernel on that array and then transform it to a square matrix:</p> <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np from scipy.spatial.distance import pdist, squareform K = squareform(var * np.exp(-gamma * pdist(X, 'sqeuclidean'))) K[np.arange(K.shape[0]), np.arange(K.shape[1])] = var </code></pre> <h3>Using <code>sklearn.metrics.pairwise.rbf_kernel</code> </h3> <p><code>sklearn</code> provides a built-in method for direct computation of an RBF kernel:</p> <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np from sklearn.metrics.pairwise import rbf_kernel K = var * rbf_kernel(X, gamma = gamma) </code></pre> <h3>Run-time comparison</h3> <p>I use 25,000 random samples of 512 dimensions for testing and perform experiments on an Intel Core i7-7700HQ (4 cores @ 2.8 GHz). More precisely:</p> <pre class="prettyprint lang-py prettyprint-override"><code>X = np.random.randn(25000, 512) gamma = 0.01 var = 5.0 </code></pre> <p>Each method is run 7 times and the mean and standard deviation of the time per execution is reported.</p> <pre class="prettyprint lang-none prettyprint-override"><code>| Method | Time | |-------------------------------------|-------------------| | numpy | 24.2 s ± 1.06 s | | numexpr | 8.89 s ± 314 ms | | scipy.spatial.distance.pdist | 2min 59s ± 312 ms | | sklearn.metrics.pairwise.rbf_kernel | 13.9 s ± 757 ms | </code></pre> <p>First of all, <code>scipy.spatial.distance.pdist</code> is surprisingly slow.</p> <p><code>numexpr</code> is almost 3 times faster than the pure <code>numpy</code> method, but this speed-up factor will vary with the number of available CPUs.</p> <p><code>sklearn.metrics.pairwise.rbf_kernel</code> is not the fastest way, but only a bit slower than <code>numexpr</code>.</p>

What is the fastest way to compute an RBF kernel in python?

Tags:

python

numpy

I would like to compute an RBF or "Gaussian" kernel for a data matrix X with n rows and d columns. The resulting square kernel matrix is given by:

K[i,j] = var * exp(-gamma * ||X[i] - X[j]||^2)

var and gamma are scalars.

What is the fastest way to do this in python?

515

asked Nov 13 '17 19:11

Callidior

1 Answers

I am going to present four different methods for computing such a kernel, followed by a comparison of their run-time.

Using pure numpy

Here, I use the fact that ||x-y||^2 = ||x||^2 + ||y||^2 - 2 * x^T * y.

import numpy as np

X_norm = np.sum(X ** 2, axis = -1)
K = var * np.exp(-gamma * (X_norm[:,None] + X_norm[None,:] - 2 * np.dot(X, X.T)))

Using numexpr

numexpr is a python package that allows for efficient and parallelized array operations on numpy arrays. We can use it as follows to perform the same computation as above:

import numpy as np
import numexpr as ne

X_norm = np.sum(X ** 2, axis = -1)
K = ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
        'A' : X_norm[:,None],
        'B' : X_norm[None,:],
        'C' : np.dot(X, X.T),
        'g' : gamma,
        'v' : var
})

Using `scipy.spatial.distance.pdist`

We could also use scipy.spatial.distance.pdist to compute a non-redundant array of pairwise squared euclidean distances, compute the kernel on that array and then transform it to a square matrix:

import numpy as np
from scipy.spatial.distance import pdist, squareform

K = squareform(var * np.exp(-gamma * pdist(X, 'sqeuclidean')))
K[np.arange(K.shape[0]), np.arange(K.shape[1])] = var

Using `sklearn.metrics.pairwise.rbf_kernel`

sklearn provides a built-in method for direct computation of an RBF kernel:

import numpy as np
from sklearn.metrics.pairwise import rbf_kernel

K = var * rbf_kernel(X, gamma = gamma)

Run-time comparison

I use 25,000 random samples of 512 dimensions for testing and perform experiments on an Intel Core i7-7700HQ (4 cores @ 2.8 GHz). More precisely:

X = np.random.randn(25000, 512)
gamma = 0.01
var = 5.0

Each method is run 7 times and the mean and standard deviation of the time per execution is reported.

|               Method                |       Time        |
|-------------------------------------|-------------------|
| numpy                               | 24.2 s ± 1.06 s   |
| numexpr                             | 8.89 s ± 314 ms   |
| scipy.spatial.distance.pdist        | 2min 59s ± 312 ms |
| sklearn.metrics.pairwise.rbf_kernel | 13.9 s ± 757 ms   |

First of all, scipy.spatial.distance.pdist is surprisingly slow.

numexpr is almost 3 times faster than the pure numpy method, but this speed-up factor will vary with the number of available CPUs.

sklearn.metrics.pairwise.rbf_kernel is not the fastest way, but only a bit slower than numexpr.

answered Oct 03 '22 11:10

Callidior

Related questions
                            
                                TypeError: must be string, not datetime.datetime when using strptime
                            
                                How to send asynchronous request using flask to an endpoint with small timeout session?
                            
                                Getting deprecation warning in Sklearn over 1d array, despite not having a 1D array
                            
                                Pandas.dataframe.query() - fetch not null rows (Pandas equivalent to SQL: "IS NOT NULL")
                            
                                PyQt: give parent when creating a widget?
                            
                                Python Multiprocessing - How to pass kwargs to function?
                            
                                eval() and run() in tensorflow
                            
                                Paramiko: Add host_key to known_hosts permanently
                            
                                cryptography AssertionError: sorry, but this version only supports 100 named groups
                            
                                Error using langdetect in python: "No features in text"
                            
                                Replacing nested for loops and value assignment for list comprehension
                            
                                Why can't I raise to a negative power in numpy?
                            
                                How to implement an import hook that can modify the source code on the fly using importlib?
                            
                                Can you declare multiple with variables in a django template?
                            
                                Why am I getting NotImplementedError with async and await on Windows?
                            
                                list comprehension in exec with empty locals: NameError
                            
                                When working with a venv virtual environment, which files should I be commiting to my git repository?
                            
                                Can generator be used more than once?
                            
                                Reading tiff image metadata in Python
                            
                                Avoid 'Reloaded modules: <module_name>' message in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to compute an RBF kernel in python?

Tags:

python

numpy

Callidior

People also ask

1 Answers

Using pure numpy

Using numexpr

Using `scipy.spatial.distance.pdist`

Using `sklearn.metrics.pairwise.rbf_kernel`

Run-time comparison

Callidior

Recent Activity

Donate For Us

What is the fastest way to compute an RBF kernel in python?

Tags:

python

numpy

Callidior

People also ask

1 Answers

Using pure numpy

Using numexpr

Using scipy.spatial.distance.pdist

Using sklearn.metrics.pairwise.rbf_kernel

Run-time comparison

Callidior

Related questions

Recent Activity

Donate For Us

Using `scipy.spatial.distance.pdist`

Using `sklearn.metrics.pairwise.rbf_kernel`