The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state.
If a greedy selection policy is used, that is, the action with the highest action value is selected 100% of the time, are SARSA and Q-learning then identical?
Can you spot the difference? In Q learning, we take action using an epsilon-greedy policy and, while updating the Q value, we simply pick up the maximum action. In SARSA, we take the action using the epsilon-greedy policy and also, while updating the Q value, we pick up the action using the epsilon-greedy policy.
SARSA vs Q-learning The difference between these two algorithms is that SARSA chooses an action following the same current policy and updates its Q-values whereas Q-learning chooses the greedy action, that is, the action that gives the maximum Q-value for the state, that is, it follows an optimal policy.
Q-learning is an off-policy algorithm. It estimates the reward for state-action pairs based on the optimal (greedy) policy, independent of the agent's actions. An off-policy algorithm approximates the optimal action-value function, independent of the policy.
Sarsa learns the safe path, along the top row of the grid because it takes the action selection method into account when learning. Because Sarsa learns the safe path, it actually receives a higher average reward per trial than Q-Learning even though it does not walk the optimal path.
If an optimal policy has already formed, SARSA with pure greedy and Q-learning are same.
However, in training, we only have a policy or sub-optimal policy, SARSA with pure greedy will only converge to the "best" sub-optimal policy available without trying to explore the optimal one, while Q-learning will do, because of , which means it tries all actions available and choose the max one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With