I am trying to understand why regularization syntax in Keras looks the way that it does.
Roughly speaking, regularization is way to reduce overfitting by adding a penalty term to the loss function proportional to some function of the model weights. Therefore, I would expect that regularization would be defined as part of the specification of the model's loss function.
However, in Keras the regularization is defined on a per-layer basis. For instance, consider this regularized DNN model:
input = Input(name='the_input', shape=(None, input_shape))
x = Dense(units = 250, activation='tanh', name='dense_1', kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
x = Dense(units = 28, name='dense_2',kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
y_pred = Activation('softmax', name='softmax')(x)
mymodel= Model(inputs=input, outputs=y_pred)
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
I would have expected that the regularization arguments in the Dense layer were not needed and I could just write the last line more like:
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'], regularization='l2')
This is obviously wrong syntax, but I was hoping someone could elaborate for me a bit on why the regularizes are defined this way and what is actually happening when I use layer-level regularization.
The other thing I don't understand is under what circumstances would I use each or all of the three regularization options: (kernel_regularizer, activity_regularizer, bias_regularizer)
?
Regularizers allow you to apply penalties on layer parameters or layer activity during optimization. These penalties are summed into the loss function that the network optimizes. Regularization penalties are applied on a per-layer basis.
The differences between L1 and L2 regularization:L1 regularization penalizes the sum of absolute values of the weights, whereas L2 regularization penalizes the sum of squares of the weights. The L1 regularization solution is sparse. The L2 regularization solution is non-sparse.
To add a regularizer to a layer, you simply have to pass in the prefered regularization technique to the layer's keyword argument 'kernel_regularizer'. The Keras regularization implementation methods can provide a parameter that represents the regularization hyperparameter value.
Regularization is a technique used for tuning the function by adding an additional penalty term in the error function. The additional term controls the excessively fluctuating function such that the coefficients don't take extreme values.
Let's break down the components of your question:
Your expectation of regularisation is probably in line with a feed-forward network where yes the penalty term is applied to the weights of the overall network. But this is not necessarily the case when you have RNNs mixed with CNNs etc so Keras opts give fine grain control. Perhaps for easy setup, a regularisation at model level could be added to the API for all weights.
When you use layer regularisation, the base Layer
class actually adds the regularising term to the loss which at training time penalises the corresponding layer's weights etc.
Now in Keras you can often apply regularisation to 3 different things as in Dense layer. Every layer has different kernels such recurrent etc, so for the question let's look at the ones you are interested in but the same roughly applies to all layers:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With