Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to understand Watkins's Q(λ) learning algorithm in Sutton&Barto's RL book?

In Sutton&Barto's RL book (link), the Watkins's Q(λ) learning algorithm presented in Figure 7.14: enter image description here Line 10 "For all s, a:", the "s,a" here is for all the (s,a), while the (s,a) in line 8 and line 9 is for the current (s,a), is this right?

In line 12 and 13, when a'!=a*, execute line 13, so all the e(s,a) will be set to 0, so what's the point of eligibility trace when all the eligibility trace are set to 0, since the situation a'!=a* will happen very often. Even if the situation a'!=a* don't happen very often, but once it happens, the meaning of eligibility trace will totally lose, then the Q will not be updated again, since all the e(s,a)=0, then in every update, the e(s,a) will still be 0 if using the replacing traces.

So, is this an error here?

like image 309
user186199 Avatar asked Nov 29 '16 09:11

user186199


People also ask

What is Q-learning explain with example?

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations.

What does Q represent in Q-learning?

The 'q' in q-learning stands for quality. Quality in this case represents how useful a given action is in gaining some future reward.

How does Q-learning work?

Q-learning is a model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent. Depending on where the agent is in the environment, it will decide the next action to be taken.

What is a good learning rate for Q-learning?

- the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.


1 Answers

The idea of eligibility traces is to give credit or blame only to the eligible state-action pairs. The book from Sutton & Barto has a nice illustration of the idea: Backward view of eligibility traces

In Watkin's Q(λ) algorithm you want to give credit/blame to the state-action pairs you actually would have visited, if you would have followed your policy Q in a deterministic way (always choosing the best action).

So the answer to your question is in line 5:

Choose a' from s' using policy derived from Q (e.g. epsilon-greedy)

Because a' is chosen epsilon greedy, there is a little chance (with probability epsilon) that you take an exploratory random step instead of a greedy step. In such a case the whole eligibility trace is set to zero, because it makes no sense to give credit/blame to state-action pairs that have been visited before. The state-action pairs you visited before the random exploratory step deserve no credit/blame for future rewards, hence you delete the whole eligibility trace. In the time steps afterwards you begin to build up a new eligibility trace...

Hope that helped.

like image 69
tom1139 Avatar answered Oct 22 '22 04:10

tom1139