Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement weight decay in tensorflow as in Caffe

In Caffe we have a decay_ratio which is usually set as 0.0005. Then all trainable parameters, e.g., W matrix in FC6 will be decayed by: W = W * (1 - 0.0005) after we applied the gradient to it.

I go through many tutorial tensorflow codes, but do not see how people implement this weight decay to prevent numerical problems (very large absolute values)

I my experiences, I often run into numerical problems aften 100k iterations during training.

I also go through related questions at stackoverflow, e.g., How to set weight cost strength in TensorFlow? However, the solution seems a little different as implemented in Caffe.

Does anyone has similar concerns? Thank you.

like image 848
user2868512 Avatar asked Aug 10 '16 20:08

user2868512


People also ask

How do you use weight decay in TensorFlow?

One way to get weight decay in TensorFlow is by adding L2-regularization to the loss. This is equivalent to weight decay for standard SGD (but not for adaptive gradient optimizers) according to Decoupled Weight Decay Regularization paper by Loshchilov & Hutter.

What is weight decay in Adam Optimizer?

What is weight decay? Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function. loss = loss + weight decay parameter * L2 norm of the weights.

How do you set weight decay in Adam?

In Adam, the weight decay is usually implemented by adding wd*w ( wd is weight decay here) to the gradients (Ist case), rather than actually subtracting from weights (IInd case).

Does Adam use regularization?

Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam.


2 Answers

The current answer is wrong in that it doesn't give you proper "weight decay as in cuda-convnet/caffe" but instead L2-regularization, which is different.

When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.

Weight decay (don't know how to TeX here, so excuse my pseudo-notation):

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2-regularization:

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)

That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.

One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.

Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.

Edit: see also this PR which just got merged into TF.

like image 153
LucasB Avatar answered Nov 15 '22 08:11

LucasB


This is a duplicate question:

How to define weight decay for individual layers in TensorFlow?

# Create your variables
weights = tf.get_variable('weights', collections=['variables'])

with tf.variable_scope('weights_norm') as scope:
  weights_norm = tf.reduce_sum(
  input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
      [tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
  ),
  name='weights_norm'
)

# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)

# Add the other loss components to the collection losses     
# ...

# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')

You can just set whatever lambda value you want to the weight decay. The above just adds the l2 norm to it.

like image 26
Steven Avatar answered Nov 15 '22 08:11

Steven