How does having smaller values for parameters help in preventing over-fitting?

Tags:

To reduce the problem of over-fitting in linear regression in machine learning , it is suggested to modify the cost function by including squares of parameters. This results in smaller values of the parameters.

This is not at all intuitive to me. How can having smaller values for parameters result in simpler hypothesis and help prevent over-fitting?

522

asked Jan 02 '16 19:01

Anant Simran Singh

2 Answers

I put together a rather contrived example, but hopefully it helps.

import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import PolynomialFeatures

First build a linear dataset, with a training and test split. 5 in each

X,y, c = datasets.make_regression(10,1, noise=5, coef=True, shuffle=True, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=5)

Original Data

Fit the data with a fifth order polynomial with no regularization.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
        ('poly',  PolynomialFeatures(5)),
        ('model', Ridge(alpha=0.))  # alpha=0 indicates 0 regularization.
    ])

pipeline.fit(X_train,y_train)

Looking at the coefficients

pipeline.named_steps['model'].coef_
pipeline.named_steps['model'].intercept_

# y_pred = -12.82 + 33.59 x + 292.32 x^2 - 193.29 x^3 - 119.64 x^4 + 78.87 x^5

No Regularization

Here the model touches all the training point, but has high coefficients and does not touch the test points.

Let's try again, but change our L2 regularization

pipeline.set_params(model__alpha=1)

With regularization

y_pred = 6.88 + 26.13 x + 16.58 x^2 + 12.47 x^3 + 5.86 x^4 - 5.20 x^5

Here we see a smoother shape, with less wiggling around. It no longer touches all the training points, but it is a much smoother curve. The coefficients are smaller due to the regularization being added.

135

answered Oct 19 '22 17:10

David Maust

This is a bit more complicated. It depends very much on the algorithm you are using.

To make an easy but slightly stupid example. Instead of optimising the parameter of the function

  y = a*x1 + b*x2

you could also optimising the parameters of

  y = 1/a * x1 + 1/b * x2

Obviously if you minimise in the former case the you need to maximise them in the latter case.

The fact that for most algorithm minimising the square of the parameters comes from computational learning theory.

Let's assume for the following you want to learn a function

 f(x) = a + bx + c * x^2 + d * x^3 +....

One can argue that a function were only a is different from zero is more likely than a function, where a and b are different form zero and so on. Following Occams razor (If you have two hypothesis explaining your data, the simpler is more likely the right one), you should prefer a hypothesis where more of you parameters are zero.

To give an example lets say your data points are (x,y) = {(-1,0),(1,0)} Which function would you prefer

f(x) = 0

f(x) = -1 +  1*x^2

Extending this a bit you can go from parameters which are zero to parameters which are small.

If you want to try it out you can sample some data points from a linear function and add a bit of gaussian noise. If you want to find a perfect polynomial fit you need a pretty complicated function with typically pretty large weights. However, if you apply regularisation you will come close to your data generating function.

But if you want to set your reasoning on rock-solid theoretical foundations I would recommend to apply Baysian statistics. The idea there is that you define a probability distribution over regression functions. That way you can define yourself what a "probable" regression function is.

(Actually Machine Learning by Tom Mitchell contains a pretty good and more detailed explanation)

answered Oct 19 '22 16:10

CAFEBABE

Related questions
                            
                                How to implement custom logloss with identical behavior to binary objective in LightGBM?
                            
                                Display graph using Tensorflow v2.0 in Tensorboard
                            
                                How to calculate confidence score of a Neural Network prediction
                            
                                TensorflowJS TFJS error: The dtype of dict
                            
                                Is it necessary to commit DVC files from our CI pipelines? [closed]
                            
                                While converting a PIL image into a tensor why the pixels are changing?
                            
                                Pause and resume caret training in R
                            
                                TensorFlow 2 Mask-RCNN? [closed]
                            
                                Multivariate Decision Tree learner
                            
                                Convergence criterion for (batch) SOM (Self-Organizing Map, aka "Kohonen Map")?
                            
                                Training neural network for XOR in Ruby
                            
                                Drawing shape context logpolar bins in MATLAB
                            
                                Similarity matrix -> feature vectors algorithm?
                            
                                What learning algorithm(s) should I consider to train a log-linear regression model?
                            
                                How to get the text of cluster centers from scikit-learn KMeans?
                            
                                CHAID analysis options for OS X / Python / R [closed]
                            
                                Updating the feature names into scikit TFIdfVectorizer
                            
                                R, Confusion Matrix in percent
                            
                                how to interpret the "soft" and "max" in the SoftMax regression?
                            
                                sklearn - model keeps overfitting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does having smaller values for parameters help in preventing over-fitting?

Tags:

machine-learning

linear-regression

Anant Simran Singh

People also ask

2 Answers

David Maust

CAFEBABE

Recent Activity

Donate For Us