Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding Regularization in Keras

Tags:

python

keras

I am trying to understand why regularization syntax in Keras looks the way that it does.

Roughly speaking, regularization is way to reduce overfitting by adding a penalty term to the loss function proportional to some function of the model weights. Therefore, I would expect that regularization would be defined as part of the specification of the model's loss function.

However, in Keras the regularization is defined on a per-layer basis. For instance, consider this regularized DNN model:

input = Input(name='the_input', shape=(None, input_shape))
x = Dense(units = 250, activation='tanh', name='dense_1', kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
x = Dense(units = 28, name='dense_2',kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
y_pred = Activation('softmax', name='softmax')(x)
mymodel= Model(inputs=input, outputs=y_pred)
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

I would have expected that the regularization arguments in the Dense layer were not needed and I could just write the last line more like:

mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'], regularization='l2')

This is obviously wrong syntax, but I was hoping someone could elaborate for me a bit on why the regularizes are defined this way and what is actually happening when I use layer-level regularization.

The other thing I don't understand is under what circumstances would I use each or all of the three regularization options: (kernel_regularizer, activity_regularizer, bias_regularizer)?

like image 471
Sledge Avatar asked Jun 01 '18 19:06

Sledge


People also ask

What is regularization Keras?

Regularizers allow you to apply penalties on layer parameters or layer activity during optimization. These penalties are summed into the loss function that the network optimizes. Regularization penalties are applied on a per-layer basis.

What is L1 vs L2 regularization?

The differences between L1 and L2 regularization:L1 regularization penalizes the sum of absolute values of the weights, whereas L2 regularization penalizes the sum of squares of the weights. The L1 regularization solution is sparse. The L2 regularization solution is non-sparse.

How do you regularize in Keras?

To add a regularizer to a layer, you simply have to pass in the prefered regularization technique to the layer's keyword argument 'kernel_regularizer'. The Keras regularization implementation methods can provide a parameter that represents the regularization hyperparameter value.

What is the concept of regularization?

Regularization is a technique used for tuning the function by adding an additional penalty term in the error function. The additional term controls the excessively fluctuating function such that the coefficients don't take extreme values.


1 Answers

Let's break down the components of your question:

  1. Your expectation of regularisation is probably in line with a feed-forward network where yes the penalty term is applied to the weights of the overall network. But this is not necessarily the case when you have RNNs mixed with CNNs etc so Keras opts give fine grain control. Perhaps for easy setup, a regularisation at model level could be added to the API for all weights.

  2. When you use layer regularisation, the base Layer class actually adds the regularising term to the loss which at training time penalises the corresponding layer's weights etc.

  3. Now in Keras you can often apply regularisation to 3 different things as in Dense layer. Every layer has different kernels such recurrent etc, so for the question let's look at the ones you are interested in but the same roughly applies to all layers:

    1. kernel: this applies to actual weights of the layer, in Dense it is the W of Wx+b.
    2. bias: this is the bias vector of the weights, so you can apply a different regulariser for it, the b in Wx+b.
    3. activity: is applied to the output vector, the y in y = f(Wx + b).
like image 101
nuric Avatar answered Sep 22 '22 19:09

nuric