Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does regularization parameter work in regularization?

In machine learning cost function, if we want to minimize the influence of two parameters, let's say theta3 and theta4, it seems like we have to give a large value of regularization parameter just like the equation below.

enter image description here

I am not quite sure why the bigger regularization parameter reduces the influence instead of increasing it. How does this function work?

like image 453
Dukakus17 Avatar asked Jan 03 '23 19:01

Dukakus17


2 Answers

It is because that the optimum values of thetas are found by minimizing the cost function.

As you increase the regularization parameter, optimization function will have to choose a smaller theta in order to minimize the total cost.

like image 165
Siva-Sg Avatar answered Jan 13 '23 15:01

Siva-Sg


Quoting from similar question's answer:

At a high level you can think of regularization parameters as applying a kind of Occam's razor that favours simple solutions. The complexity of models is often measured by the size of the model w viewed as a vector. The overall loss function as in your example above consists of an error term and a regularization term that is weighted by λ, the regularization parameter. So the regularization term penalizes complexity (regularization is sometimes also called penalty). It is useful to think what happens if you are fitting a model by gradient descent. Initially your model is very bad and most of the loss comes from the error terms, so the model is adjusted to primarily to reduce the error term. Usually the magnitude of the model vector increases as the optimization progresses. As the model is improving and the model vector is growing the regularization term becomes a more significant part of the loss. Regularization prevents the model vector growing arbitrarily for negligible reductions in the error. λ just determines the relative importance of keeping the model simple relative to reducing training error. There are different types of regularization terms in common use. The one you have, and most commonly used in SVMs, is L2 regularization. It has the side effect of spreading weight more evenly between the components of the model vector. The main alternative is L1 or lasso regularization which has the form λ∑i|wi|, ie it penalizes the sum absolute values of the model parameters. It favors concentrating the size of the model in only a few components, the opposite of L2 regularization. Generally L2 tends to be preferable for low dimensional models while lasso tends to work better for high dimensional models like text classification where it leads to sparse models, ie models with few non-zero parameters. There is also elastic net regularization, which is just a weighted combination of L1 and L2 regularization. So you have 3 terms in your loss function: error term and the 2 regularization terms each with its own regularization parameter.

like image 21
Failed Scientist Avatar answered Jan 13 '23 13:01

Failed Scientist