Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get out of 'sticky' states? [closed]

The problem:

I've trained an agent to perform a simple task in a grid world (go to the top of the grid while not hitting obstacles), but the following situation always seems to occur. It finds itself in a easy part of the state space (no obstacles), and so continually gets a strong positive reinforcement signal. Then when it does find itself is difficult part of the state space (wedged next to two obstacles) it simply chooses same action as before, to no effect (It goes up and hits the obstacle). Eventually the Q value for this value matches the negative reward, but by this time the other actions have even lower Q values from being useless in the easy part of the state space, so the error signal drops to zero and the incorrect action is still always chosen.

How can I prevent this from happening? I've thought of a few solutions, but none seem viable:

  • Use a policy that is always exploration heavy. As the obstacles take ~5 actions to get around, a single random action every now and then seems ineffective.
  • Make the reward function such that bad actions are worse when they are repeated. This makes the reward function break the Markov property. Maybe this isn't a bad thing, but I simply don't have a clue.
  • Only reward the agent for completing the task. The task takes over a thousand actions to complete, so the training signal would be way too weak.

Some background on the task:

So I've made a little testbed for trying out RL algorithms -- something like a more complex version of the grid-world described in the Sutton book. The world is a large binary grid (300 by 1000) populated by 1's in the form of randomly sized rectangles on a backdrop of 0's. A band of 1's surrounds the edges of the world.

An agent occupies a single space in this world and only a fixed windows around it (41 by 41 window with the agent in the center). The agent's actions consist of moving by 1 space in any of the four cardinal directions. The agent can only move through spaces marked by a 0, 1's are impassible.

The current task to be performed in this environment is to make it to the top of the grid world starting from a random position along the bottom. A reward of +1 is given for successfully moving upwards. A reward of -1 is given for any move that would hit an obstacle or the edge of the world. All other states receive a reward of 0.

The agent uses the basic SARSA algorithm with a neural net value function approximator (as discussed in the Sutton book). For policy decisions I've tried both e-greedy and softmax.

like image 788
zergylord Avatar asked Oct 07 '22 02:10

zergylord


1 Answers

The typical way of teaching such tasks is to give the agent a negative reward each step and then a big payout on completion. You can compensate for the long delay by using eligibility traces and by placing the agent close to the goal initially, and then close to the area it has explored.

like image 86
Don Reba Avatar answered Oct 25 '22 19:10

Don Reba