Suppose I want to write a custom optimizer class that conforms to the tf.keras
API (using TensorFlow version>=2.0). I am confused about the documented way to do this versus what's done in implementations.
The documentation for tf.keras.optimizers.Optimizer
states,
### Write a customized optimizer.
If you intend to create your own optimization algorithm, simply inherit from
this class and override the following methods:
- resource_apply_dense (update variable given gradient tensor is dense)
- resource_apply_sparse (update variable given gradient tensor is sparse)
- create_slots (if your optimizer algorithm requires additional variables)
However, the current tf.keras.optimizers.Optimizer
implementation does not define a resource_apply_dense
method, but it does define a private-looking _resource_apply_dense
method stub. Similarly, there are no resource_apply_sparse
or create_slots
methods, but there are a _resource_apply_sparse
method stub and a _create_slots
method call.
In official tf.keras.optimizers.Optimizer
subclasses (using tf.keras.optimizers.Adam
as an example), there are _resource_apply_dense
, _resource_apply_sparse
, and _create_slots
methods, and there are no such methods without the leading underscore.
There are similar leading-underscore methods in slightly-less-official tf.keras.optimizers.Optimizer
subclasses (e.g., tfa.optimizers.MovingAverage
from TensorFlow Addons: _resource_apply_dense
, _resource_apply_sparse
, _create_slots
).
Another confounding point for me is that some of the TensorFlow Addons optimizers also override the apply_gradients
method (e.g., tfa.optimizers.MovingAverage
), whereas the tf.keras.optimizers
optimizers do not.
Moreover, I noticed that the apply_gradients
method of tf.keras.optimizers.Optimizer
method calls _create_slots
, but the base tf.keras.optimizers.Optimizer
class does not have a _create_slots
method.
So, it seems that a _create_slots
method must be defined in an optimizer subclass if that subclass does not override apply_gradients
.
What is the correct way to subclass a tf.keras.optimizers.Optimizer
? Specifically,
tf.keras.optimizers.Optimizer
documentation listed at the top simply mean to override the leading-underscore versions of the methods they mention (e.g., _resource_apply_dense
instead of resource_apply_dense
)? If so, are there any API guarantees about these private-looking methods not changing their behavior in future versions of TensorFlow? What are the signatures of these methods?apply_gradients
in addition to the _apply_resource_[dense|sparse]
methods?Edit. Opened issue on GitHub: #36449
For most (custom) optimizer implementations, the method apply_gradients() needs to be adapted. This method relies on the (new) Optimizer (class), which we will create, to implement the following methods: _create_slots(), _prepare(), _apply_dense(), and _apply_sparse().
Update: TF2.2 forced me to clean up all implementations - so now they can be used as a reference for TF best practices. Also added a section below on _get_hyper
vs. _set_hyper
.
I've implemented Keras AdamW in all major TF & Keras versions - I invite you to examine optimizers_v2.py. Several points:
OptimizerV2
, which is actually what you linked; it's the latest and current base class for tf.keras
optimizersapply_gradients
(or any other method) is only overidden if the default doesn't accomplish what's needed for a given optimizer; in your linked example, it's just a one-liner addon to the original_create_slots
method must be defined in an optimizer subclass if that subclass does not override apply_gradients
" - the two are unrelated; it's coincidental._resource_apply_dense
and _resource_apply_sparse
?Latter deals with sparse layers - e.g. Embedding
- and former with everything else; example.
_create_slots()
?When defining trainable tf.Variable
s; example: weights' first and second order moments (e.g. Adam). It uses add_slot()
.
_get_hyper
vs. _set_hyper
: they enable setting and getting Python literals (int
, str
, etc), callables, and tensors. They exist largely for convenience: anything set via _set_hyper
can be retrieved via _get_hyper
, avoiding repeating boilerplate code. I dedicated a Q&A to it here.
def _create_slots(self, var_list):
"""Create all slots needed by the variables.
Args:
var_list: A list of `Variable` objects.
"""
# No slots needed by default
pass
def _resource_apply_dense(self, grad, handle):
"""Add ops to apply dense gradients to the variable `handle`.
Args:
grad: a `Tensor` representing the gradient.
handle: a `Tensor` of dtype `resource` which points to the variable
to be updated.
Returns:
An `Operation` which updates the value of the variable.
"""
raise NotImplementedError()
def _resource_apply_sparse(self, grad, handle, indices):
"""Add ops to apply sparse gradients to the variable `handle`.
Similar to `_apply_sparse`, the `indices` argument to this method has been
de-duplicated. Optimizers which deal correctly with non-unique indices may
instead override `_resource_apply_sparse_duplicate_indices` to avoid this
overhead.
Args:
grad: a `Tensor` representing the gradient for the affected indices.
handle: a `Tensor` of dtype `resource` which points to the variable
to be updated.
indices: a `Tensor` of integral type representing the indices for
which the gradient is nonzero. Indices are unique.
Returns:
An `Operation` which updates the value of the variable.
"""
raise NotImplementedError()
apply_dense
. For one thing, if you do override it, the code mentions that a per-replica DistributionStrategy could be "dangerous" # TODO(isaprykin): When using a DistributionStrategy, and when an
# optimizer is created in each replica, it might be dangerous to
# rely on some Optimizer methods. When such methods are called on a
# per-replica optimizer, an exception needs to be thrown. We do
# allow creation per-replica optimizers however, because the
# compute_gradients()->apply_gradients() sequence is safe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With