I'm looking at this SARSA-Lambda implementation (Ie: SARSA with eligibility traces) and there's a detail which I still don't get.
(Image from http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node77.html)
So I understand that all Q(s,a) are updated rather than only the one the agent has chosen for the given time-step. I also understand the E matrix is not reset at the start of each episode.
Let's assume for a minute that panel 3 of Figure 7.12 was the end-state of episode 1.
At the start of episode 2, the agent moves north instead of east, and let's assume this gives it a reward of -500. Wouldn't this affect also all states that were visited in the previous episode?
If the idea is to reward those states which have been visited in the current episode, then why isn't the matrix containing all e(s,a) values reset at the beginning of each episode? It just seems like with this implementation states that have been visited in the previous episode are 'punished' or 'rewarded' for actions done by the agent in this new episode.
I agree with you 100%. Failing to reset the e-matrix at the start of every episode has exactly the problems that you describe. As far as I can tell, this is an error in the pseudocode. The reference that you cite is very popular, so the error has been propagated to many other references. However, this well-cited paper very clearly states that e-matrix should be reinitialized between episodes:
The eligibility traces are initialized to zero, and in episodic tasks they are reinitialized to zero after every episode.
As further evidence, the methods of this paper:
The trace, e, is set to 0 at the beginning of each episode.
and footnote #3 from this paper:
...eligibility traces were reset to zero at the start of each trial.
suggest that this is common practice, as both refer to reinitialization between episodes. I expect that there are many more such examples.
In practice, many uses of this algorithm don't involve multiple episodes, or have such long episodes relative to their decay rates that this doesn't end up being a problem. I expect that is why it hasn't been clarified more explicitly elsewhere on the internet yet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With