Deep Deterministic Policy Gradient (DDPG) is the state-of-the-art method for reinforcement learning when the action space is continuous. Its core algorithm is Deterministic Policy Gradient.
However, after reading the papers and listening to the talk (http://techtalks.tv/talks/deterministic-policy-gradient-algorithms/61098/), I still cannot figure out what the fundamental advantage for Deterministic PG over Stochastic PG is. The talk says it is more suitable for high dimensional actions and easier to train, but why is that?
Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.
Policy gradients are more effective in high dimensional action spaces. The second advantage is that policy gradients are more effective in high dimensional action spaces, or when using continuous actions.
In summary, a stochastic policy gradient algorithm tries to: Update the parameters θ of the actor π towards the gradient of the performance J(θ) Update parameters w of the critic, with regular temporal difference learning algorithms, which I introduced in deep Q-learning.
A policy is a function can be either deterministic or stochastic. It dictates what action to take given a particular state. The distribution π(a∣s) is used for a stochastic policy and a mapping function π:S→A is used for a deterministic policy, where S is the set of possible states and A is the set of possible actions.
The main reason of policy gradient method is to solve continuous action space problem, which is difficult for Q learning due to the Global Q maximization.
SPG can solve continuous action space problem since it represents the policy by a continuous probability distribution. Since SPG assumes its policy to be a distribution, it needs integral over actions to get the gradient of overall reward. SPG resorts to importance sampling to do this integration.
DPG represents the policy by a deterministic mapping from state to action. It can do it because it is not taking the action of the global greatest Q but it selects actions according to the deterministic mapping(if on policy) while shift this deterministic mapping by the gradient of Q(both on and off policy). The gradient of overall reward then has a form does not needs the integral over actions and it is easier to be computed.
One can say that it seems to be a step back changing from stochastic policy to deterministic policy. But the stochastic policy is first introduced to handle continuous action space only. Deterministic policy now provides another way to handle continuous action space.
My observation is obtained from these papers:
Deterministic Policy Gradient Algorithms
Policy Gradient Methods for Reinforcement Learning with Function Approximation
Continuous Control With Deep Reinforcement Learning
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With