I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.
Here is an excerpt from the nature paper describing the loss clipping:
We also found it helpful to clip the error term from the update to be between −1 and 1. Because the absolute value loss function |x| has a derivative of −1 for all negative values of x and a derivative of 1 for all positive values of x, clipping the squared error to be between −1 and 1 corresponds to using an absolute value loss function for errors outside of the (−1,1) interval. This form of error clipping further improved the stability of the algorithm.
(link to full paper: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)
What I have tried so far is using
clipped_loss_vec = tf.clip_by_value(loss, -1, 1)
to clip the loss I calculate between -1 and +1. The agent is not learning the proper policy in this case. I printed out the gradients of the network and realized that if the loss falls below -1, the gradients all suddenly turn to 0!
My reasoning for this happening is that the clipped loss is a constant function in (-inf,-1) U (1,inf), which means it has zero gradient in those regions. This in turn ensures that the gradients throughout the network are zero (think of it as, whatever input image I provide the network, the loss stays at -1 in the local neighborhood because it has been clipped).
So, my question is two parts:
What exactly did Deepmind mean in the excerpt? Did they mean that the loss below -1 is clipped to -1 and above +1 is clipped to +1. If so, how did they deal with the gradients (i.e. what is all that part about absolute value functions?)
How should I implement loss clipping in tensor flow such that the gradients do not go to zero outside the clipped range (but maybe stay at +1 and -1)? Thanks!
I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.
Equivalently: Define a function f
as follows:
f(x) = x^2 if x in [-0.5,0.5]
f(x) = |x| - 0.25 if x < -0.5 or x > 0.5
Instead of using something of the form s^2
as the loss function (where s
is some complicated expression), they suggest to use f(s)
as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2
when s
is small, but when s
gets larger, it will behave like the absolute value (|s|
).
Notice that the derivative of f
has the nice property that its derivative will always be in the range [-1,1]:
f'(x) = 2x if x in [-0.5,0.5]
f'(x) = +1 if x > +1
f'(x) = -1 if x < -1
Thus, when you take the gradient of this f
-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.
Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f
is just two times the Huber loss for delta = 0.5.
Now the point is that the following two alternatives are equivalent:
Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.
Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.
The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.
First of all, the code for the paper is available online, which constitutes an invaluable reference.
If you take a look at the code you will see that, in nql:getQUpdate
(NeuralQLearner.lua
, line 180), they clip the error term of the Q-learning function:
-- delta = r + (1-terminal) * gamma * max_a Q(s2, a) - Q(s, a)
if self.clip_delta then
delta[delta:ge(self.clip_delta)] = self.clip_delta
delta[delta:le(-self.clip_delta)] = -self.clip_delta
end
In TensorFlow, assuming the last layer of your neural network is called self.output
, self.actions
is a one-hot encoding of all actions, self.q_targets_
is a placeholder with the targets, and self.q
is your computed Q:
# The loss function
one = tf.Variable(1.0)
delta = self.q - self.q_targets_
absolute_delta = tf.abs(delta)
delta = tf.where(
absolute_delta < one,
tf.square(delta),
tf.ones_like(delta) # squared error: (-1)^2 = 1
)
Or, using tf.clip_by_value
(and having an implementation closer to the original):
delta = tf.clip_by_value(
self.q - self.q_targets_,
-1.0,
+1.0
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With