Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does having smaller values for parameters help in preventing over-fitting?

To reduce the problem of over-fitting in linear regression in machine learning , it is suggested to modify the cost function by including squares of parameters. This results in smaller values of the parameters.

This is not at all intuitive to me. How can having smaller values for parameters result in simpler hypothesis and help prevent over-fitting?

like image 522
Anant Simran Singh Avatar asked Jan 02 '16 19:01

Anant Simran Singh


People also ask

How Does number of observations influence overfitting?

1. When the hypothsis space is richer, overfitting is more likely. 2. when the feature space is larger , overfitting is more likely.

How do you reduce overfitting in regression?

To avoid overfitting a regression model, you should draw a random sample that is large enough to handle all of the terms that you expect to include in your model. This process requires that you investigate similar studies before you collect data.


2 Answers

I put together a rather contrived example, but hopefully it helps.

import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import PolynomialFeatures

First build a linear dataset, with a training and test split. 5 in each

X,y, c = datasets.make_regression(10,1, noise=5, coef=True, shuffle=True, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=5)

Original Data

Fit the data with a fifth order polynomial with no regularization.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
        ('poly',  PolynomialFeatures(5)),
        ('model', Ridge(alpha=0.))  # alpha=0 indicates 0 regularization.
    ])

pipeline.fit(X_train,y_train)

Looking at the coefficients

pipeline.named_steps['model'].coef_
pipeline.named_steps['model'].intercept_

# y_pred = -12.82 + 33.59 x + 292.32 x^2 - 193.29 x^3 - 119.64 x^4 + 78.87 x^5

No Regularization

Here the model touches all the training point, but has high coefficients and does not touch the test points.

Let's try again, but change our L2 regularization

pipeline.set_params(model__alpha=1)

With regularization

y_pred = 6.88 + 26.13 x + 16.58 x^2 + 12.47 x^3 + 5.86 x^4 - 5.20 x^5

Here we see a smoother shape, with less wiggling around. It no longer touches all the training points, but it is a much smoother curve. The coefficients are smaller due to the regularization being added.

like image 135
David Maust Avatar answered Oct 19 '22 17:10

David Maust


This is a bit more complicated. It depends very much on the algorithm you are using.

To make an easy but slightly stupid example. Instead of optimising the parameter of the function

  y = a*x1 + b*x2 

you could also optimising the parameters of

  y = 1/a * x1 + 1/b * x2 

Obviously if you minimise in the former case the you need to maximise them in the latter case.

The fact that for most algorithm minimising the square of the parameters comes from computational learning theory.

Let's assume for the following you want to learn a function

 f(x) = a + bx + c * x^2 + d * x^3 +....

One can argue that a function were only a is different from zero is more likely than a function, where a and b are different form zero and so on. Following Occams razor (If you have two hypothesis explaining your data, the simpler is more likely the right one), you should prefer a hypothesis where more of you parameters are zero.

To give an example lets say your data points are (x,y) = {(-1,0),(1,0)} Which function would you prefer

f(x) = 0 

or

f(x) = -1 +  1*x^2

Extending this a bit you can go from parameters which are zero to parameters which are small.

If you want to try it out you can sample some data points from a linear function and add a bit of gaussian noise. If you want to find a perfect polynomial fit you need a pretty complicated function with typically pretty large weights. However, if you apply regularisation you will come close to your data generating function.

But if you want to set your reasoning on rock-solid theoretical foundations I would recommend to apply Baysian statistics. The idea there is that you define a probability distribution over regression functions. That way you can define yourself what a "probable" regression function is.

(Actually Machine Learning by Tom Mitchell contains a pretty good and more detailed explanation)

like image 28
CAFEBABE Avatar answered Oct 19 '22 16:10

CAFEBABE