What is the difference between Kernel Ridge (from sklearn.kernel_ridge) with polynomial kernel and using PolynomialFeatures + Ridge (from sklearn.linear_model)?
The difference is in feature computation. PolynomialFeatures
explicitly computes polynomial combinations between the input features up to the desired degree while KernelRidge(kernel='poly')
only considers a polynomial kernel (a polynomial representation of feature dot products) which will be expressed in terms of the original features. This document provides a good overview in general.
Regarding the computation we can inspect the relevant parts from the source code:
The computation of the (training) kernel follows a similar procedure: compare Ridge
and KernelRidge
. The major difference is that Ridge
explicitly considers the dot product between whatever (polynomial) features it has received while for KernelRidge
these polynomial features are generated implicitly during the computation. For example consider a single feature x
; with gamma = coef0 = 1
the KernelRidge
computes (x**2 + 1)**2 == (x**4 + 2*x**2 + 1)
. If you consider now PolynomialFeatures
this will provide features x**2, x, 1
and the corresponding dot product is x**4 + x**2 + 1
. Hence the dot product differs by a term x**2
. Of course we could rescale the poly-features to have x**2, sqrt(2)*x, 1
while with KernelRidge(kernel='poly')
we don't have this kind of flexibility. On the other hand the difference probably doesn't matter (in most cases).
Note that also the computation of the dual coefficients is performed in a similar manner: Ridge
and KernelRidge
. Finally KernelRidge
keeps the dual coefficients while Ridge
directly computes the weights.
Let's see a small example:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils.extmath import safe_sparse_dot
np.random.seed(20181001)
a, b = 1, 4
x = np.linspace(0, 2, 100).reshape(-1, 1)
y = a*x**2 + b*x + np.random.normal(scale=0.2, size=(100,1))
poly = PolynomialFeatures(degree=2, include_bias=True)
xp = poly.fit_transform(x)
print('We can see that the new features are now [1, x, x**2]:')
print(f'xp.shape: {xp.shape}')
print(f'xp[-5:]:\n{xp[-5:]}', end='\n\n')
# Scale the `x` columns so we obtain similar results.
xp[:, 1] *= np.sqrt(2)
ridge = Ridge(alpha=0, fit_intercept=False, solver='cholesky')
ridge.fit(xp, y)
krr = KernelRidge(alpha=0, kernel='poly', degree=2, gamma=1, coef0=1)
krr.fit(x, y)
# Let's try to reproduce some of the involved steps for the different models.
ridge_K = safe_sparse_dot(xp, xp.T)
krr_K = krr._get_kernel(x)
print('The computed kernels are (alomst) similar:')
print(f'Max. kernel difference: {np.abs(ridge_K - krr_K).max()}', end='\n\n')
print('Predictions slightly differ though:')
print(f'Max. difference: {np.abs(krr.predict(x) - ridge.predict(xp)).max()}', end='\n\n')
# Let's see if the fit changes if we provide `x**2, x, 1` instead of `x**2, sqrt(2)*x, 1`.
xp_2 = xp.copy()
xp_2[:, 1] /= np.sqrt(2)
ridge_2 = Ridge(alpha=0, fit_intercept=False, solver='cholesky')
ridge_2.fit(xp_2, y)
print('Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:')
print(f'Max. difference: {np.abs(ridge_2.predict(xp_2) - ridge.predict(xp)).max()}', end='\n\n')
print('Interpretability of the coefficients changes though:')
print(f'ridge.coef_[1:]: {ridge.coef_[0, 1:]}, ridge_2.coef_[1:]: {ridge_2.coef_[0, 1:]}')
print(f'ridge.coef_[1]*sqrt(2): {ridge.coef_[0, 1]*np.sqrt(2)}')
print(f'Compare with: a, b = ({a}, {b})')
plt.plot(x.ravel(), y.ravel(), 'o', color='skyblue', label='Data')
plt.plot(x.ravel(), ridge.predict(xp).ravel(), '-', label='Ridge', lw=3)
plt.plot(x.ravel(), krr.predict(x).ravel(), '--', label='KRR', lw=3)
plt.grid()
plt.legend()
plt.show()
From which we obtain:
We can see that the new features are now [x, x**2]:
xp.shape: (100, 3)
xp[-5:]:
[[1. 1.91919192 3.68329762]
[1. 1.93939394 3.76124885]
[1. 1.95959596 3.84001632]
[1. 1.97979798 3.91960004]
[1. 2. 4. ]]
The computed kernels are (alomst) similar:
Max. kernel difference: 1.0658141036401503e-14
Predictions slightly differ though:
Max. difference: 0.04244651134471766
Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:
Max. difference: 7.15642822779472e-14
Interpretability of the coefficients changes though:
ridge.coef_[1:]: [2.73232239 1.08868872], ridge_2.coef_[1:]: [3.86408737 1.08868872]
ridge.coef_[1]*sqrt(2): 3.86408737392841
Compare with: a, b = (1, 4)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With