I'm currently trying to implement DDPG in Keras. I know how to update the critic network (normal DQN algorithm), but I'm currently stuck on updating the actor network, which uses the equation:
so in order to reduce the loss of the actor network wrt to its weight dJ/dtheta, it's using chain rule to get dQ/da (from critic network) * da/dtheta (from actor network).
This looks fine, but I'm having trouble understanding how to derive the gradients from those 2 networks. Could someone perhaps explain this part to me?
So the main intuition is that here, J is something you want to maximize instead of minimize. Therefore, we can call it an objective function instead of a loss function. The equation simplifies down to:
dJ/dTheta = dQ / da * da / dTheta = dQ/dTheta
Meaning you want to change the parameters Theta to change Q. Since in RL, we want to maximize Q, for this part, we want to do gradient ascent instead. To do this, you just perform gradient descent, except feed the gradients as negative values.
To derive the gradients, do the following:
Divide all elements of gradient J by the batch size, i.e.,
for j in J,
j / batch size
I hope this makes sense! I also had a hard time understanding this concept, and am still a little fuzzy on some parts to be completely honest. Let me know if I can clarify anything!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With