Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define weight decay for individual layers in TensorFlow?

Tags:

tensorflow

In CUDA ConvNet, we can write something like this (source) for each layer:

[conv32]
epsW=0.001
epsB=0.002
momW=0.9
momB=0.9
wc=0

where wc=0 refers to the L2 weight decay.

How can the same be achieved in TensorFlow?

like image 683
M.Y. Babt Avatar asked Apr 12 '16 10:04

M.Y. Babt


1 Answers

Both current answers are wrong in that they do not give you "weight decay as in cuda-convnet" but instead L2-regularization, which is different.

When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.

Weight decay (don't know how to TeX here, so excuse my pseudo-notation):

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2-regularization:

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)

That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.

One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.

Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.

Edit: see also this PR which just got merged into TF.

like image 99
LucasB Avatar answered Dec 08 '22 00:12

LucasB