Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are off-policy learning methods better than on-policy methods?

I cannot understand what the fundamental difference between on-policy methods (like A3C) and off-policy methods (like DDPG) is. As far as I know, off-policy methods can learn the optimal policy regardless of the behavior policy. It can learn by observing any trajectory in the environment. Therefore, can I say off-policy methods are better than on-policy methods?

I have read the cliff-walking example showing the difference between SARSA and Q-learning. It says that Q-learning would learn the optimal policy to walk along the cliff, while SARSA would learn to choose a safer way when using the epsilon-greedy policy. But since Q-learning have already told us that the optimal policy, why don't we just follow that policy instead of keep exploring?

Plus, are there situations for the two kinds of learning methods that one is better than the other? in which case would one prefer on-policy algorithms?

like image 659
DarkZero Avatar asked Mar 05 '17 09:03

DarkZero


People also ask

Is Q-learning on policy or off policy?

Q-learning is an off-policy algorithm (Sutton & Barto, 1998), meaning the target can be computed without consideration of how the experience was generated. In principle, off- policy reinforcement learning algorithms are able to learn from data collected by any behavioral policy.

Is Q-learning better than sarsa?

In practice, if you want to fast in a fast-iterating environment, QL should be your choice. However, if mistakes are costly (unexpected minimal failure — robots), then SARSA is the better option. If your state space is too large, try exploring the deep q network.

Is Dqn on policy or off policy?

Mixing on-policy targets with approximate off-policy targets may reduce the negative effects of the approximate update. In contrast, DQN implements a true off-policy update in discrete action space and shows no benefit from mixed updates.


1 Answers

As you already said, off-policy methods can learn the optimal policy regardless of the behaviour policy (actually the behaviour policy should have some properties), while on-policy methods require the agent acts with the policy that it's being learnt.

Imagine the situation where you have a data set of trajectories (i.e, data in form of tuples (s,a,r,s')) previously stored. This data has been collected applying a given policy, and you cannot change it. In this case, which is common for medical problems, you can apply only off-policy methods.

This means that off-policy methods are better? No necessarily. We can say that off-policy methods are more flexible in the type of problems they can face. However, from the theoretical point of view, they have different properties that sometimes are convenient. For instance, if we compare Q-learning versus SARSA algorithm, a key difference between them is the max operator used in the Q-learning update rule. This operator is highly non-linear, which can make more difficult to combine the algorithm with function approximators.

When it's better to use on-policy methods? Well, if you are facing a problem with continuous state-space and you are interested in using a linear function approximator (and RFB network, for instance). Then it's more stable to use on-policy methods. You can find more information on this topic in the Section off-policy bootstrapping of Sutton and Barto's book.

like image 63
Pablo EM Avatar answered Oct 30 '22 10:10

Pablo EM