I'm in the process of development of a simple Q-Learning implementation over a trivial application, but there's something that keeps puzzling me.
Let's consider the standard formulation of Q-Learning
Q(S, A) = Q(S, A) + alpha * [R + MaxQ(S', A') - Q(S, A)]
Let's assume there's this state K
that has two possible actions, both awarding our agent rewards R
and R'
by A
and A'
.
If we follow an almost-totally-greedy approach (let's say we assume a 0.1 epsilon), I'll at first randomly choose one of the actions, for instance A
. The next time, I'll probably (90% of the times) choose again A
and that will cause that Q(K, A) keeps growing and growing, being true the case that even if by chance I try A'
, as probably its reward is on the same magnitude as that of A, we'll have gotten into a situation where it's practically impossible to "recover" from our first guess, during the rest of the learning.
I guess this must not be so, otherwise the agent would basically not learn -- it'd be just follow a simple recipe : do everything as you did your first time.
Am I missing something? I know I can tweak the alpha value (generally, decreasing it over time), but that in no way improves our situation.
From this, we know:
The convergence of Q-learning holds using any exploration policy, and only requires that each state action pair
(s,a)
is executed infinitely often.
The epsilon-greedy policy
is a balance between exploration and exploitation, which both guarantees convergence and often good performance. But in practical problems, we often need some heuristics to change the learning speed alpha
to represent a better return. Otherwise, the infinite often
requirement is hard to meet.
I list an example below. This is a classical problem, in which you have a grid and you have possibly different reward amount in each cell. For instance, a 4x4 grid is shown below, in which every cell contains a reward of 1
, except the top-left cell (you have a bigger reward with an amount of 10
). A robot is moving in the grid. The legal actions are moving LEFT
, RIGHT
, UP
and DOWN
, but the robot cannot move out of the grid.
So our state space contains 16 distinct states, which corresponds to the 16 cells. For each state, there are different number of legal actions because of the border constraint. Our goal is to calculate the optimal policy (given any state s
, output an optimal action a
).
+++++++++++++++++++++
+ 10 + 1 + 1 + 1 +
+++++++++++++++++++++
+ 1 + 1 + 1 + 1 +
+++++++++++++++++++++
+ 1 + 1 + 1 + 1 +
+++++++++++++++++++++
+ 1 + 1 + 1 + 1 +
+++++++++++++++++++++
Suppose we use an epsilon-greedy policy
with epsilon=0.1
, a constant learning rate alpha=0.1
. We start with a random position on the grid. Whenever we reach the top-left corner, we restart with a random position again.
Below is a result of running a simulation of 200,000 moves. The left-most block shows visually the current greedy policy at each cell.
-->
Moving right<--
Moving left^
Moving upv
Moving downSo you see this is far from an optimal policy. Obviously in an optimal policy, every cell should point to either left or up because we have a significant bigger reward at position (0,0)
.
v v v v | 2 9 5 4
v v v v | 14 98 75 14
--> v v <-- | 258 3430 3312 245
--> --> <-- <-- | 3270 93143 92978 3191
The right block shows how many times we visit each cell so far. You see that we spend most of the visits at the bottom but we visit the top row very rare. This is why we have not reached the optimal policy yet.
If we change the learning rate to be alpha=1/(number of times you visited (s,a) so far)
, we are able to reach the optimal policy (shown below) within 20,000 steps. Also the number of times we visited each cell are distributed more uniformly though not perfect.
--> <-- <-- <-- | 34 7997 7697 294
^ ^ ^ <-- | 731 898 524 132
^ ^ ^ ^ | 709 176 88 94
^ ^ ^ ^ | 245 256 96 77
For a larger problem with more states, e.g., a 10x10 grids, I find it's better to use larger epsilon
. For example, below is a result of a simulation after 80,000 moves on a 10x10 grid with epsilon=0.5
. It's almost optimal except the bottom-right corner. There is also idea about using Simulated Annealing to help improving the convergence rate of Q-learning.
v <-- <-- <-- <-- <-- <-- <-- <-- <-- | 19 2500 1464 716 386 274 216 159 121 71
^ <-- <-- <-- <-- v <-- <-- <-- <-- | 9617 11914 3665 1071 580 410 319 225 207 131
^ ^ ^ <-- <-- <-- <-- v <-- <-- | 5355 5716 2662 1675 1465 611 302 183 162 101
^ ^ ^ ^ ^ <-- <-- <-- <-- <-- | 1604 1887 1192 621 1056 882 693 403 206 100
^ ^ ^ ^ ^ ^ ^ <-- <-- <-- | 639 735 731 333 412 399 480 294 172 114
^ ^ ^ <-- ^ ^ ^ <-- <-- ^ | 373 496 640 454 272 266 415 219 107 98
^ ^ ^ ^ ^ ^ ^ ^ <-- ^ | 251 311 402 428 214 161 343 176 114 99
^ ^ ^ ^ <-- --> ^ <-- <-- <-- | 186 185 271 420 365 209 359 200 113 70
^ ^ ^ ^ ^ ^ ^ ^ v v | 129 204 324 426 434 282 235 131 99 74
^ ^ ^ ^ ^ <-- ^ <-- <-- <-- | 100 356 1020 1233 703 396 301 216 152 78
BTW, my Python code (~100 lines) for the toy problem is here.
Q(K, A)
does not just keep growing infinitely, due to the minus Q(S, A)
term. This is more apparent if you rewrite the update rule to:
Q(S, A) <-- Q(S, A)(1 - a) + a(R + maxQ(S', A'))
This shows that Q(K, A)
slowly moves towards its "actual" value of R + maxQ(S', A')
. Q(K, A)
only grows to approach that; not infinitely. When it stops growing (has approximated its actual value), the Q(K, A)
for other A
s can catch up.
Anyway, the whole point of epsilon is to control whether you want the learning process to be more greedy (heuristic) or explorative (random), so increase it if the learning process is too narrow.
Also note that one of the formal conditions for QL convergence is that each pair of (S, A)
are visited an infinite number of times (paraphrased)! So yes, at the end of the training process, you want each pair to have been visited a decent amount of times.
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With