Let's assume we're in a room where our agent can move along the xx and yy axis. At each point he can move up, down, right and left. So our state space can be defined by (x, y) and our actions at each point are given by (up, down, right, left). Let's assume that wherever our agent does an action that will make him hit a wall we will give him a negative reward of -1, and put him back in the state he was before. If he finds in the center of the room a puppet he wins +10 reward.
When we update our QValue for a given state/action pair, we are seeing what actions can be done in the new state and computing what is the maximum QValue that is possible to get there, so we can update our Q(s, a) value for our current state/action. What this means is that if we have a goal state in the point (10, 10), all states around it will have a QValue a bit smaller and smaller as they get farther. Now, in relationship to the walls, it seems to me the same is not true.
When the agent hits a wall(let's assume he's in the position (0, 0) and did the action UP), he will receive for that state/action a reward of -1, thus getting a QValue of -1.
Now, if later I am in the state (0, 1), and assuming all the other actions of state (0,0 0) are zero, when calculating the QValue of (0, 1) for the action LEFT, it will compute it the following way:
Q([0,1], LEFT) = 0 + gamma * (max { 0, 0, 0, -1 } ) = 0 + 0 = 0
This is, having hit the wall doesn't propagate to nearby states, contrary to what happens when you have positive reward states.
In my optic this seems odd. At first I thought finding state/action pairs giving negative rewards would be learningwise as good as positive rewards, but from the example I have shown above, that statement doesn't seem to hold true. There seems to be a bias in the algorithm for taking far more into consideration positive rewards than negative ones.
Is this the expected behavior of QLearning? Shouldn't bad rewards be just as important as positive ones? What are "work-arounds" for this?
Negative feedback only propagates when it is the only possible outcome from a particular move.
Whether this is deliberate or unintentional I do not know.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With