Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are Q-learning and SARSA with greedy selection equivalent?

The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state.

If a greedy selection policy is used, that is, the action with the highest action value is selected 100% of the time, are SARSA and Q-learning then identical?

like image 319
Mouscellaneous Avatar asked Sep 29 '15 14:09

Mouscellaneous


People also ask

Is there a difference between SARSA and Q-learning if greedy action selection is used?

Can you spot the difference? In Q learning, we take action using an epsilon-greedy policy and, while updating the Q value, we simply pick up the maximum action. In SARSA, we take the action using the epsilon-greedy policy and also, while updating the Q value, we pick up the action using the epsilon-greedy policy.

What is the difference between Q-learning and SARSA?

SARSA vs Q-learning The difference between these two algorithms is that SARSA chooses an action following the same current policy and updates its Q-values whereas Q-learning chooses the greedy action, that is, the action that gives the maximum Q-value for the state, that is, it follows an optimal policy.

Is Q-learning a greedy algorithm?

Q-learning is an off-policy algorithm. It estimates the reward for state-action pairs based on the optimal (greedy) policy, independent of the agent's actions. An off-policy algorithm approximates the optimal action-value function, independent of the policy.

Why is SARSA faster than Q-learning?

Sarsa learns the safe path, along the top row of the grid because it takes the action selection method into account when learning. Because Sarsa learns the safe path, it actually receives a higher average reward per trial than Q-Learning even though it does not walk the optimal path.


1 Answers

If an optimal policy has already formed, SARSA with pure greedy and Q-learning are same.

However, in training, we only have a policy or sub-optimal policy, SARSA with pure greedy will only converge to the "best" sub-optimal policy available without trying to explore the optimal one, while Q-learning will do, because of enter image description here, which means it tries all actions available and choose the max one.

like image 179
Alan Yang Avatar answered Sep 18 '22 15:09

Alan Yang