I cannot understand what the fundamental difference between on-policy methods (like A3C
) and off-policy methods (like DDPG
) is. As far as I know, off-policy methods can learn the optimal policy regardless of the behavior policy. It can learn by observing any trajectory in the environment. Therefore, can I say off-policy methods are better than on-policy methods?
I have read the cliff-walking example showing the difference between SARSA
and Q-learning
. It says that Q-learning
would learn the optimal policy to walk along the cliff, while SARSA
would learn to choose a safer way when using the epsilon-greedy
policy. But since Q-learning
have already told us that the optimal policy, why don't we just follow that policy instead of keep exploring?
Plus, are there situations for the two kinds of learning methods that one is better than the other? in which case would one prefer on-policy algorithms?
Q-learning is an off-policy algorithm (Sutton & Barto, 1998), meaning the target can be computed without consideration of how the experience was generated. In principle, off- policy reinforcement learning algorithms are able to learn from data collected by any behavioral policy.
In practice, if you want to fast in a fast-iterating environment, QL should be your choice. However, if mistakes are costly (unexpected minimal failure — robots), then SARSA is the better option. If your state space is too large, try exploring the deep q network.
Mixing on-policy targets with approximate off-policy targets may reduce the negative effects of the approximate update. In contrast, DQN implements a true off-policy update in discrete action space and shows no benefit from mixed updates.
As you already said, off-policy methods can learn the optimal policy regardless of the behaviour policy (actually the behaviour policy should have some properties), while on-policy methods require the agent acts with the policy that it's being learnt.
Imagine the situation where you have a data set of trajectories (i.e, data in form of tuples (s,a,r,s')
) previously stored. This data has been collected applying a given policy, and you cannot change it. In this case, which is common for medical problems, you can apply only off-policy methods.
This means that off-policy methods are better? No necessarily. We can say that off-policy methods are more flexible in the type of problems they can face. However, from the theoretical point of view, they have different properties that sometimes are convenient. For instance, if we compare Q-learning versus SARSA algorithm, a key difference between them is the max
operator used in the Q-learning update rule. This operator is highly non-linear, which can make more difficult to combine the algorithm with function approximators.
When it's better to use on-policy methods? Well, if you are facing a problem with continuous state-space and you are interested in using a linear function approximator (and RFB network, for instance). Then it's more stable to use on-policy methods. You can find more information on this topic in the Section off-policy bootstrapping of Sutton and Barto's book.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With