I think I am messing something up. I always thought that: - 1-step TD on-policy = Sarsa - 1-step TD off-policy = Q-learning Thus I conclude: - n-step TD on-policy = n-step Sarsa - n-step TD off-policy = n-step Q-learning In Sutton's book, however, he never introduces n-step Q-Learning, but he does introduce n-step off-policy Sarsa. Now I feel confused. Can someone help me with the naming? Link to Sutton's book (Off-Policy n-step Sarsa on page 149)

<blockquote> I always thought that: <ul> <li>1-step TD on-policy = Sarsa</li> <li>1-step TD off-policy = Q-learning</li> </ul> </blockquote> That's mostly correct, but not the full story. Q-learning is a version of off-policy 1-step temporal-difference learning, but not just that; it's specifically updating Q-values for the policy that is greedy with respect to current estimates. Off-policy value learning can be more general, it can be about learning for any target policy; Q-learning is more specific, it's specifically about having the greedy policy as target policy. A naive extension of Q-learning to <code>n</code> steps would no longer be correct, because that doesn't work for off-policy algorithms (like Q-learning). You'd have to correct for the "off-policyness" in some way; one way to do that is importance sampling. When you introduce that in a more general manner (for any possible target policy), you get the algorithm on that page you mentioned, which they refer to there as Off-policy <code>n</code>-step Sarsa. I suppose that a specific instance of this algorithm, with the target policy <code>pi</code> being the greedy policy with respect to <code>Q</code>, could intuitively be understood as a "correct" version of <code>n</code>-step Q-learning.

Why is there no n-step Q-learning algorithm in Sutton's RL book?

I think I am messing something up.

I always thought that:
- 1-step TD on-policy = Sarsa
- 1-step TD off-policy = Q-learning

Thus I conclude: - n-step TD on-policy = n-step Sarsa
- n-step TD off-policy = n-step Q-learning

In Sutton's book, however, he never introduces n-step Q-Learning, but he does introduce n-step off-policy Sarsa. Now I feel confused.

Can someone help me with the naming?

Link to Sutton's book (Off-Policy n-step Sarsa on page 149)

What is N-Step Q-learning?

The N-step Q learning algorithm works in similar manner to DQN except for the following changes: No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every N steps using the latest N steps played by the agent.

What is N-step algorithm?

Another fundamental algorithm is the use of n-Step returns, rather than single step returns in the basic Q-learning or SARSA implementations. Rather than just looking one step into the future and estimating the return, you can look several steps.

Is Q-learning offline RL?

Note that Q-function training in offline RL does not suffer from state distribution shift, as the Bellman backup never queries the Q-function on out-of-distribution states. However, the policy may suffer from state distribution shift at test time.

Why Q-learning is off policy?

Q-learning is called off-policy because the updated policy is different from the behavior policy, so Q-Learning is off-policy. In other words, it estimates the reward for future actions and appends a value to the new state without actually following any greedy policy.

I always thought that:

1-step TD on-policy = Sarsa

1-step TD off-policy = Q-learning

That's mostly correct, but not the full story. Q-learning is a version of off-policy 1-step temporal-difference learning, but not just that; it's specifically updating Q-values for the policy that is greedy with respect to current estimates. Off-policy value learning can be more general, it can be about learning for any target policy; Q-learning is more specific, it's specifically about having the greedy policy as target policy.

A naive extension of Q-learning to n steps would no longer be correct, because that doesn't work for off-policy algorithms (like Q-learning). You'd have to correct for the "off-policyness" in some way; one way to do that is importance sampling. When you introduce that in a more general manner (for any possible target policy), you get the algorithm on that page you mentioned, which they refer to there as Off-policy n-step Sarsa. I suppose that a specific instance of this algorithm, with the target policy pi being the greedy policy with respect to Q, could intuitively be understood as a "correct" version of n-step Q-learning.

Why is there no n-step Q-learning algorithm in Sutton's RL book?

Tags:

reinforcement-learning

q-learning

sarsa

siva

People also ask

1 Answers

Dennis Soemers

Recent Activity

Donate For Us

Why is there no n-step Q-learning algorithm in Sutton's RL book?

Tags:

reinforcement-learning

q-learning

sarsa

siva

People also ask

1 Answers

Dennis Soemers

Related questions

Recent Activity

Donate For Us