What is the advantage of Deterministic Policy Gradient over Stochastic Policy Gradient?

Tags:

reinforcement-learning

Deep Deterministic Policy Gradient (DDPG) is the state-of-the-art method for reinforcement learning when the action space is continuous. Its core algorithm is Deterministic Policy Gradient.

However, after reading the papers and listening to the talk (http://techtalks.tv/talks/deterministic-policy-gradient-algorithms/61098/), I still cannot figure out what the fundamental advantage for Deterministic PG over Stochastic PG is. The talk says it is more suitable for high dimensional actions and easier to train, but why is that?

992

asked Mar 13 '17 12:03

DarkZero

1 Answers

The main reason of policy gradient method is to solve continuous action space problem, which is difficult for Q learning due to the Global Q maximization.

SPG can solve continuous action space problem since it represents the policy by a continuous probability distribution. Since SPG assumes its policy to be a distribution, it needs integral over actions to get the gradient of overall reward. SPG resorts to importance sampling to do this integration.

DPG represents the policy by a deterministic mapping from state to action. It can do it because it is not taking the action of the global greatest Q but it selects actions according to the deterministic mapping(if on policy) while shift this deterministic mapping by the gradient of Q(both on and off policy). The gradient of overall reward then has a form does not needs the integral over actions and it is easier to be computed.

One can say that it seems to be a step back changing from stochastic policy to deterministic policy. But the stochastic policy is first introduced to handle continuous action space only. Deterministic policy now provides another way to handle continuous action space.

My observation is obtained from these papers:

Deterministic Policy Gradient Algorithms

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Continuous Control With Deep Reinforcement Learning

answered Oct 25 '22 20:10

Ronald Ku

Related questions
                            
                                Function Approximation: How is tile coding different from highly discretized state space?
                            
                                Stuck in understanding the difference between update usels of TD(0) and TD(λ)
                            
                                Q Learning Algorithm for Tic Tac Toe
                            
                                Reinforcement learning algorithms for continuous states, discrete actions
                            
                                Observations meaning - OpenAI Gym
                            
                                Alpha and Gamma parameters in QLearning
                            
                                tensorflow: how come gather_nd is differentiable?
                            
                                Understanding the total_timesteps parameter in stable-baselines' models
                            
                                net.zero_grad() vs optim.zero_grad() pytorch
                            
                                PyTorch Model Training: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
                            
                                Are Q-learning and SARSA with greedy selection equivalent?
                            
                                actor critic policy loss going to zero (with no improvement)
                            
                                How to make softmax work with policy gradient?
                            
                                Optimize deep Q network with long episode
                            
                                Using Reinforcement Learning for Classfication Problems [closed]
                            
                                How can I register a custom environment in OpenAI's gym?
                            
                                What are the uses of recurrent neural networks when using them with Reinforcement Learning?
                            
                                Q-learning vs dynamic programming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With