In Sutton&Barto's RL book (link), the Watkins's Q(λ) learning algorithm presented in Figure 7.14: Line 10 "For all s, a:", the "s,a" here is for all the (s,a), while the (s,a) in line 8 and line 9 is for the current (s,a), is this right?
In line 12 and 13, when a'!=a*, execute line 13, so all the e(s,a) will be set to 0, so what's the point of eligibility trace when all the eligibility trace are set to 0, since the situation a'!=a* will happen very often. Even if the situation a'!=a* don't happen very often, but once it happens, the meaning of eligibility trace will totally lose, then the Q will not be updated again, since all the e(s,a)=0, then in every update, the e(s,a) will still be 0 if using the replacing traces.
So, is this an error here?
Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations.
The 'q' in q-learning stands for quality. Quality in this case represents how useful a given action is in gaining some future reward.
Q-learning is a model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent. Depending on where the agent is in the environment, it will decide the next action to be taken.
- the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.
The idea of eligibility traces is to give credit or blame only to the eligible state-action pairs. The book from Sutton & Barto has a nice illustration of the idea: Backward view of eligibility traces
In Watkin's Q(λ) algorithm you want to give credit/blame to the state-action pairs you actually would have visited, if you would have followed your policy Q in a deterministic way (always choosing the best action).
So the answer to your question is in line 5:
Choose a' from s' using policy derived from Q (e.g. epsilon-greedy)
Because a' is chosen epsilon greedy, there is a little chance (with probability epsilon) that you take an exploratory random step instead of a greedy step. In such a case the whole eligibility trace is set to zero, because it makes no sense to give credit/blame to state-action pairs that have been visited before. The state-action pairs you visited before the random exploratory step deserve no credit/blame for future rewards, hence you delete the whole eligibility trace. In the time steps afterwards you begin to build up a new eligibility trace...
Hope that helped.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With