I've been looking into machine learning recently and now making my first steps with scikit and linear regression.
Here is my first sample
from sklearn import linear_model
import numpy as np
X = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]
y = [2,4,6,8,10,12,14,16,18,20]
clf = linear_model.LinearRegression()
clf.fit (X, y)
print(clf.predict([11]))
==> 22
The output is as expected 22
(apparently scikit comes up with 2x
as the hypothesis function). But when I create a slightly more complicated example with y = [1,4,9,16,25,36,49,64,81,100]
my code just creates crazy output. I assumed linear regression would come up with a quadratic function (x^2) but instead I don't know what is going on. The output for 11 is now: 99
. So I guess my code tries to find some kind of linear function to map all the examples.
In the tutorial on linear regression that I did there were examples of polynomial terms, so I assumed scikits implementation would come up with a correct solution. Am I wrong? If so, how do I teach scikit to consider quadratic, cubic, etc... functions?
LinearRegression
fits a linear model to data. In the case of one-dimensional X
values like you have above, the results is a straight line (i.e. y = a + b*x
). In the case of two-dimensional values, the result is a plane (i.e. z = a + b*x + c*y
). So you can't expect a linear regression model to perfectly fit a quadratic curve: it simply doesn't have enough model complexity to do that.
That said, you can cleverly transform your input data in order to fit a quadratic curve with a linear regression model. Consider the 2D case above:
z = a + b*x + c*y
Now let's make the substitution y = x^2
. That is, we add a second dimension to our data which contains the quadratic term. Now we have another linear model:
z = a + b*x + c*x^2
The result is a model that is quadratic in x
, but still linear in the coefficients! That is, we can solve it easily via a linear regression: this is an example of a basis function expansion of the input data. Here it is in code:
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.arange(10)[:, None]
y = np.ravel(x) ** 2
p = np.array([1, 2])
model = LinearRegression().fit(x ** p, y)
model.predict(11 ** p)
# [121]
This is a bit awkward, though, because the model requires 2D input to predict()
, so you have to transform the input manually. If you want this transformation to happen automatically, you can use e.g.PolynomialFeatures
in a pipeline:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(x, y).predict(11)
# [121]
This is one of the beautiful things about linear models: using basis function expansion like this, they can be very flexible, while remaining very fast! You could think about adding columns with cubic, quartic, or other terms, and it's still a linear regression. Or for periodic models, you might think about adding columns of sines, cosines, etc. In the extreme limit of this, the so-called "kernel trick" allows you to effectively add an infinite number of new columns to your data, and end up with a model that is very powerful – but still linear and thus still relatively fast! For an example of this type of estimator, take a look at scikit-learn's KernelRidge
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With