This might seems a stupid question, but I just can't come up with a reasonable answer.
It is said that regularization can help us obtain simple models over complex ones to avoid over-fitting. But for a linear classification problem:
f(x) = Wx
The complexity of the model is somewhat specified: it's linear, not quadratic or something more complex. So why do we still need regularization on the parameters? Why do we prefer smaller weights in such cases?
The need to regularize a model will tend to be less and less as you increase the number of samples that you want to train the model with or your reduce the model's complexity. However, the number of examples needed to train a model without (or with a very very small regularization effect) increases [super]exponentially with the number of parameters and possibly some other factors inherit in a model.
Since in most machine learning problems, we do not have the required number of training samples or the model complexity is large we have to use regularization in order to avoid, or lessen the possibility, of over-fitting. Intuitively, the way regularization works is it introduces a penalty term to argmin∑L(desired,predictionFunction(Wx))
where L
is a loss function that computes how much the model's prediction deviates from the desired targets. So the new loss function becomes argmin∑L(desired,predictionFunction(Wx)) + lambda*reg(w)
where reg
is a type of regularization (e.g. squared L2
) and lambda
is a coefficient that controls the regularization effect. Then, naturally, while minimizing the cost function the weight vectors are restricted to have a small squared length (e.g. squared L2 norm
) and shrink towards zero. This is because the larger the squared length of weight vectors, the higher the loss is. Therefore the weight vectors also need to compensate for lowering the model's loss while the optimization is running.
Now imagine if you remove the regularization term (lambda = 0). Then the model parameters are free to have any values and so do the squared length of weight vectors can grow no matter you have a linear or non-linear model. This adds another dimension to the complexity of the model (in addition to the number of parameters) and the optimization procedure may find weight vectors that can exactly match the training data points. However, when exposed to unseen (validation or test) data sets the model will not be able to generalize well since it has over-fitted to the training data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With