How can I implement max norm constraints on the weights in an MLP in tensorflow? The kind that Hinton and Dean describe in their work on dark knowledge. That is, does tf.nn.dropout implement the weight constraints by default, or do we need to do it explicitly, as in
https://arxiv.org/pdf/1207.0580.pdf
"If these networks share the same weights for the hidden units that are present. We use the standard, stochastic gradient descent procedure for training the dropout neural networks on mini-batches of training cases, but we modify the penalty term that is normally used to prevent the weights from growing too large. Instead of penalizing the squared length (L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming weight vector for each individual hidden unit. If a weight-update violates this constraint, we renormalize the weights of the hidden unit by division."
Keras appears to have it
http://keras.io/constraints/
Max-norm regularization is a regularization technique that constrains the weights of a neural network. The constraint imposed on the network by max-norm regularization is simple. The weight vector associated with each neuron is forced to have an \ell_2 norm of at most r, where r is a hyperparameter.
A weight constraint is an update to the network that checks the size of the weights, and if the size exceeds a predefined limit, the weights are rescaled so that their size is below the limit or between a range.
keras. constraints module allow setting constraints (eg. non-negativity) on model parameters during training. They are per-variable projection functions applied to the target variable after each gradient update (when using fit() ).
tf.nn.dropout
does not impose any norm constraint. I believe what you're looking for is to "process the gradients before applying them" using tf.clip_by_norm
.
For example, instead of simply:
# Create an optimizer + implicitly call compute_gradients() and apply_gradients()
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
You could:
# Create an optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Compute the gradients for a list of variables.
grads_and_vars = optimizer.compute_gradients(loss, [weights1, weights2, ...])
# grads_and_vars is a list of tuples (gradient, variable).
# Do whatever you need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_norm(gv[0], clip_norm=123.0, axes=0), gv[1])
for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients
optimizer = optimizer.apply_gradients(capped_grads_and_vars)
I hope this helps. Final notes about tf.clip_by_norm
's axes
parameter:
tf.nn.xw_plus_b(x, weights, biases)
, or equivalently matmul(x, weights) + biases
, when the dimensions of x
and weights
are (batch, in_units)
and (in_units, out_units)
respectively, then you probably want to set axes == [0]
(because in this usage each column details all incoming weights to a specific unit).clip_by_norm
each of them! E.g. if some of [weights1, weights2, ...]
are matrices and some aren't, and you call clip_by_norm()
on the grads_and_vars
with the same axes
value like in the List Comprehension above, this doesn't mean the same thing for all the variables! In fact, if you're lucky, this will result in a weird error like ValueError: Invalid reduction dimension 1 for input with 1 dimensions
, but otherwise it's a very sneaky bug.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With