My question is about using the SARSA algorithm in reinforcement learning for an undiscounted, continuing (non-episodic) problem (can it be used for such a problem?)
I have been studying the textbook by Sutton and Barto, and they show how to modify the Q-learning algorithm so that it can be used for undiscounted problems. They refer to the new algorithm (for undiscounted problems) as R-learning, in Chapter 6.7. Basically, in R-learning, the update rule for Q(s,a) on each iteration is:
Q(s,a) = Q(s,a) + alpha * [r - rho + max_a{Q(s',a)} - Q(s,a)]
Here, rho is updated on each iteration only if a greedy action is chosen at state s. The update rule for rho is:
rho = rho + beta * [r - rho + max_a{Q(s',a)} - max_a{Q(s,a)}]
(Here, alpha and beta are learning parameters.) Now, my question is to do with SARSA, rather than Q-learning. I want to modify the SARSA algorithm so that it is suitable for average reward (undiscounted) problems, in the same way that the Q-learning was modified to be used for average reward problems (I don't know if this is possible?). However, in the literature I cannot find an explanation of exactly how SARSA should be modified for an average reward problem.
Here is my guess for how SARSA should be used in an undiscounted problem. I would guess that the update rule should be:
Q(s,a) = Q(s,a) + alpha * [r - rho + Q(s',a') - Q(s,a)],
where a' is the action actually chosen at state s. This seems fairly obvious. But how should rho be updated? My guess is that since SARSA is an on-policy algorithm, rho should always be updated on each iteration - regardless of whether or not a greedy action is chosen at s - and the update rule should simply be:
rho = rho + beta * [r - rho + Q(s',a') - Q(s,a)].
Could somebody tell me if this is correct? Or should rho still be updated based on the optimal actions at the states s and s'?
First of all, the problem is that an undiscounted non-episodic task is an ill conditioned problem, because the expected reward is divergent (unless rewards have some property which makes them diminishing in the future).
EDIT: I'm sorry, I've look-up the referenced chapter in the book, and noticed that indeed the R-learning IS a method to tackle undiscounted non-episodic tasks.
AD REM: I think that the idea behind updating rho in such manner is to estimate the average reward of the current policy. Therefore I'm guessing that even though SARSA is an on-policy method, you should update rho only if greedy action has been chosen. And that's because if you want to accurately estimate the average reward of the current policy, you should take into account only events that would occur when you would be following this policy. Events that occur as a result of exploration steps do not represent an accurate sample of "what's this policy is worth". That's of course only an intuitive argument - I don't have any experience with R-learning nor did I analysed this issue formally.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With