How to implement weight decay in tensorflow as in Caffe

2 Answers

The current answer is wrong in that it doesn't give you proper "weight decay as in cuda-convnet/caffe" but instead L2-regularization, which is different.

When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.

Weight decay (don't know how to TeX here, so excuse my pseudo-notation):

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2-regularization:

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)

That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.

One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.

Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.

Edit: see also this PR which just got merged into TF.

153

answered Nov 15 '22 08:11

LucasB

This is a duplicate question:

How to define weight decay for individual layers in TensorFlow?

# Create your variables
weights = tf.get_variable('weights', collections=['variables'])

with tf.variable_scope('weights_norm') as scope:
  weights_norm = tf.reduce_sum(
  input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
      [tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
  ),
  name='weights_norm'
)

# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)

# Add the other loss components to the collection losses     
# ...

# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')

You can just set whatever lambda value you want to the weight decay. The above just adds the l2 norm to it.

answered Nov 15 '22 08:11

Steven

Related questions
                            
                                How to feed sound as input to neural networks? [closed]
                            
                                What is the ideal value of loss function for a GAN
                            
                                Calculate face_descriptor faster
                            
                                What is the purpose of keras utils normalize?
                            
                                How does a Neural Network "remember" what its learned?
                            
                                Custom Hebbian Layer Implementation in Keras - input/output dims and lateral node connections
                            
                                What is the difference between conv1d with kernel_size=1 and dense layer?
                            
                                How to see the loss of the best epoch from early stopping in Keras?
                            
                                Validation dataset in PyTorch using DataLoaders
                            
                                InvalidArgumentError: required broadcastable shapes at loc(unknown)
                            
                                Effects of randomizing the order of inputs to a neural network
                            
                                Long term prediction using Artificial Neural Network
                            
                                Does this neural network model exist?
                            
                                How fast are Deep Learning techniques (DNN, DBN, ...) in practice ? [closed]
                            
                                Are there standard input, weight and output values for neural network nodes? [closed]
                            
                                Using machine learning to make a computer learn calculus
                            
                                Where is layer module defined in PyCaffe
                            
                                Using neural networks to estimate distance in an image
                            
                                How to get both score and accuracy after training
                            
                                Tensorflow LSTM RNN output activation function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to implement weight decay in tensorflow as in Caffe

Tags:

neural-network

tensorflow

deep-learning

user2868512

People also ask

2 Answers

LucasB

Steven

Recent Activity

Donate For Us