Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the advantage of Deterministic Policy Gradient over Stochastic Policy Gradient?

Deep Deterministic Policy Gradient (DDPG) is the state-of-the-art method for reinforcement learning when the action space is continuous. Its core algorithm is Deterministic Policy Gradient.

However, after reading the papers and listening to the talk (http://techtalks.tv/talks/deterministic-policy-gradient-algorithms/61098/), I still cannot figure out what the fundamental advantage for Deterministic PG over Stochastic PG is. The talk says it is more suitable for high dimensional actions and easier to train, but why is that?

like image 992
DarkZero Avatar asked Mar 13 '17 12:03

DarkZero


People also ask

What is deterministic policy gradient?

Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.

What is advantage in policy gradient?

Policy gradients are more effective in high dimensional action spaces. The second advantage is that policy gradients are more effective in high dimensional action spaces, or when using continuous actions.

What is stochastic policy gradient?

In summary, a stochastic policy gradient algorithm tries to: Update the parameters θ of the actor π towards the gradient of the performance J(θ) Update parameters w of the critic, with regular temporal difference learning algorithms, which I introduced in deep Q-learning.

What is deterministic policy in reinforcement learning?

A policy is a function can be either deterministic or stochastic. It dictates what action to take given a particular state. The distribution π(a∣s) is used for a stochastic policy and a mapping function π:S→A is used for a deterministic policy, where S is the set of possible states and A is the set of possible actions.


1 Answers

The main reason of policy gradient method is to solve continuous action space problem, which is difficult for Q learning due to the Global Q maximization.

SPG can solve continuous action space problem since it represents the policy by a continuous probability distribution. Since SPG assumes its policy to be a distribution, it needs integral over actions to get the gradient of overall reward. SPG resorts to importance sampling to do this integration.

DPG represents the policy by a deterministic mapping from state to action. It can do it because it is not taking the action of the global greatest Q but it selects actions according to the deterministic mapping(if on policy) while shift this deterministic mapping by the gradient of Q(both on and off policy). The gradient of overall reward then has a form does not needs the integral over actions and it is easier to be computed.

One can say that it seems to be a step back changing from stochastic policy to deterministic policy. But the stochastic policy is first introduced to handle continuous action space only. Deterministic policy now provides another way to handle continuous action space.

My observation is obtained from these papers:

Deterministic Policy Gradient Algorithms

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Continuous Control With Deep Reinforcement Learning

like image 64
Ronald Ku Avatar answered Oct 25 '22 20:10

Ronald Ku