Are off-policy learning methods better than on-policy methods?

Tags:

I cannot understand what the fundamental difference between on-policy methods (like A3C) and off-policy methods (like DDPG) is. As far as I know, off-policy methods can learn the optimal policy regardless of the behavior policy. It can learn by observing any trajectory in the environment. Therefore, can I say off-policy methods are better than on-policy methods?

I have read the cliff-walking example showing the difference between SARSA and Q-learning. It says that Q-learning would learn the optimal policy to walk along the cliff, while SARSA would learn to choose a safer way when using the epsilon-greedy policy. But since Q-learning have already told us that the optimal policy, why don't we just follow that policy instead of keep exploring?

Plus, are there situations for the two kinds of learning methods that one is better than the other? in which case would one prefer on-policy algorithms?

659

asked Mar 05 '17 09:03

DarkZero

1 Answers

As you already said, off-policy methods can learn the optimal policy regardless of the behaviour policy (actually the behaviour policy should have some properties), while on-policy methods require the agent acts with the policy that it's being learnt.

Imagine the situation where you have a data set of trajectories (i.e, data in form of tuples (s,a,r,s')) previously stored. This data has been collected applying a given policy, and you cannot change it. In this case, which is common for medical problems, you can apply only off-policy methods.

This means that off-policy methods are better? No necessarily. We can say that off-policy methods are more flexible in the type of problems they can face. However, from the theoretical point of view, they have different properties that sometimes are convenient. For instance, if we compare Q-learning versus SARSA algorithm, a key difference between them is the max operator used in the Q-learning update rule. This operator is highly non-linear, which can make more difficult to combine the algorithm with function approximators.

When it's better to use on-policy methods? Well, if you are facing a problem with continuous state-space and you are interested in using a linear function approximator (and RFB network, for instance). Then it's more stable to use on-policy methods. You can find more information on this topic in the Section off-policy bootstrapping of Sutton and Barto's book.

answered Oct 30 '22 10:10

Pablo EM

Related questions
                            
                                SARSA algorithm
                            
                                How to get out of 'sticky' states? [closed]
                            
                                Q-Learning convergence to optimal policy
                            
                                What do model.predict() and model.fit() do?
                            
                                CartPole-v0 stuck at a score of exactly 200 [closed]
                            
                                Learning rate of a Q learning agent
                            
                                How to understand Watkins's Q(λ) learning algorithm in Sutton&Barto's RL book?
                            
                                Negative rewards in QLearning
                            
                                When to use a certain Reinforcement Learning algorithm?
                            
                                NameError: name 'base' is not defined OpenAI Gym

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are off-policy learning methods better than on-policy methods?

Tags:

reinforcement-learning

q-learning

DarkZero

People also ask

1 Answers

Pablo EM

Recent Activity

Donate For Us