Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the L1 regularization in Keras/Tensorflow *really* L1-regularization?

I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01) to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.

Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.

This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..) here.

Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?

EDIT: Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.

like image 663
Cokes Avatar asked Mar 31 '17 16:03

Cokes


1 Answers

So despite @Joshua answer, there are three other things that are worth to mention:

  1. There is no problem connected with a gradient in 0. keras is automatically setting it to 1 similarly to relu case.
  2. Remember that values lesser than 1e-6 are actually equal to 0 as this is float32 precision.
  3. The problem of not having most of the values set to 0 might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a high l1 value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weight w = 0.005 your learning rate is equal to 0.01 and a gradient of the main loss is equal to 0 w.r.t. to w. So your weight would be updated in the following manner:

    w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),
    

    and after the second update:

    w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).
    

    As you may see the absolute value of w hasn't decreased even though you applied l1 regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1 norm regularizer.

like image 179
Marcin Możejko Avatar answered Sep 28 '22 17:09

Marcin Możejko