Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom TensorFlow Keras optimizer

Suppose I want to write a custom optimizer class that conforms to the tf.keras API (using TensorFlow version>=2.0). I am confused about the documented way to do this versus what's done in implementations.

The documentation for tf.keras.optimizers.Optimizer states,

  ### Write a customized optimizer.
  If you intend to create your own optimization algorithm, simply inherit from
  this class and override the following methods:

    - resource_apply_dense (update variable given gradient tensor is dense)
    - resource_apply_sparse (update variable given gradient tensor is sparse)
    - create_slots (if your optimizer algorithm requires additional variables)

However, the current tf.keras.optimizers.Optimizer implementation does not define a resource_apply_dense method, but it does define a private-looking _resource_apply_dense method stub. Similarly, there are no resource_apply_sparse or create_slots methods, but there are a _resource_apply_sparse method stub and a _create_slots method call.

In official tf.keras.optimizers.Optimizer subclasses (using tf.keras.optimizers.Adam as an example), there are _resource_apply_dense, _resource_apply_sparse, and _create_slots methods, and there are no such methods without the leading underscore.

There are similar leading-underscore methods in slightly-less-official tf.keras.optimizers.Optimizer subclasses (e.g., tfa.optimizers.MovingAverage from TensorFlow Addons: _resource_apply_dense, _resource_apply_sparse, _create_slots).

Another confounding point for me is that some of the TensorFlow Addons optimizers also override the apply_gradients method (e.g., tfa.optimizers.MovingAverage), whereas the tf.keras.optimizers optimizers do not.

Moreover, I noticed that the apply_gradients method of tf.keras.optimizers.Optimizer method calls _create_slots, but the base tf.keras.optimizers.Optimizer class does not have a _create_slots method. So, it seems that a _create_slots method must be defined in an optimizer subclass if that subclass does not override apply_gradients.


Questions

What is the correct way to subclass a tf.keras.optimizers.Optimizer? Specifically,

  1. Does the tf.keras.optimizers.Optimizer documentation listed at the top simply mean to override the leading-underscore versions of the methods they mention (e.g., _resource_apply_dense instead of resource_apply_dense)? If so, are there any API guarantees about these private-looking methods not changing their behavior in future versions of TensorFlow? What are the signatures of these methods?
  2. When would one override apply_gradients in addition to the _apply_resource_[dense|sparse] methods?

Edit. Opened issue on GitHub: #36449

like image 430
Artem Mavrin Avatar asked Nov 08 '19 19:11

Artem Mavrin


People also ask

How do I create a custom Optimizer in Tensorflow?

For most (custom) optimizer implementations, the method apply_gradients() needs to be adapted. This method relies on the (new) Optimizer (class), which we will create, to implement the following methods: _create_slots(), _prepare(), _apply_dense(), and _apply_sparse().


2 Answers

Update: TF2.2 forced me to clean up all implementations - so now they can be used as a reference for TF best practices. Also added a section below on _get_hyper vs. _set_hyper.


I've implemented Keras AdamW in all major TF & Keras versions - I invite you to examine optimizers_v2.py. Several points:

  • You should inherit OptimizerV2, which is actually what you linked; it's the latest and current base class for tf.keras optimizers
  • You are correct in (1) - this is a documentation mistake; the methods are private, as they aren't meant to be used by the user directly.
  • apply_gradients (or any other method) is only overidden if the default doesn't accomplish what's needed for a given optimizer; in your linked example, it's just a one-liner addon to the original
  • "So, it seems that a _create_slots method must be defined in an optimizer subclass if that subclass does not override apply_gradients" - the two are unrelated; it's coincidental.

  • What is the difference between _resource_apply_dense and _resource_apply_sparse?

Latter deals with sparse layers - e.g. Embedding - and former with everything else; example.

  • When should I use _create_slots()?

When defining trainable tf.Variables; example: weights' first and second order moments (e.g. Adam). It uses add_slot().


_get_hyper vs. _set_hyper: they enable setting and getting Python literals (int, str, etc), callables, and tensors. They exist largely for convenience: anything set via _set_hyper can be retrieved via _get_hyper, avoiding repeating boilerplate code. I dedicated a Q&A to it here.

like image 59
OverLordGoldDragon Avatar answered Oct 19 '22 15:10

OverLordGoldDragon


  1. Yes, this looks to be a documentation error. The preceding underscore names are the correct methods to override. Related is the non-Keras Optimizer which has these all defined, but not implemented in the base class https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py
  def _create_slots(self, var_list):
    """Create all slots needed by the variables.
    Args:
      var_list: A list of `Variable` objects.
    """
    # No slots needed by default
    pass

  def _resource_apply_dense(self, grad, handle):
    """Add ops to apply dense gradients to the variable `handle`.
    Args:
      grad: a `Tensor` representing the gradient.
      handle: a `Tensor` of dtype `resource` which points to the variable
       to be updated.
    Returns:
      An `Operation` which updates the value of the variable.
    """
    raise NotImplementedError()

  def _resource_apply_sparse(self, grad, handle, indices):
    """Add ops to apply sparse gradients to the variable `handle`.
    Similar to `_apply_sparse`, the `indices` argument to this method has been
    de-duplicated. Optimizers which deal correctly with non-unique indices may
    instead override `_resource_apply_sparse_duplicate_indices` to avoid this
    overhead.
    Args:
      grad: a `Tensor` representing the gradient for the affected indices.
      handle: a `Tensor` of dtype `resource` which points to the variable
       to be updated.
      indices: a `Tensor` of integral type representing the indices for
       which the gradient is nonzero. Indices are unique.
    Returns:
      An `Operation` which updates the value of the variable.
    """
    raise NotImplementedError()
  1. I don't know about apply_dense. For one thing, if you do override it, the code mentions that a per-replica DistributionStrategy could be "dangerous"
    # TODO(isaprykin): When using a DistributionStrategy, and when an
    # optimizer is created in each replica, it might be dangerous to
    # rely on some Optimizer methods.  When such methods are called on a
    # per-replica optimizer, an exception needs to be thrown.  We do
    # allow creation per-replica optimizers however, because the
    # compute_gradients()->apply_gradients() sequence is safe.
like image 2
Tyler Avatar answered Oct 19 '22 16:10

Tyler