Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exploration Algorithm

Massively edited this question to make it easier to understand.

Given an environment with arbitrary dimensions and arbitrary positioning of an arbitrary number of obstacles, I have an agent exploring the environment with a limited range of sight (obstacles don't block sight). It can move in the four cardinal directions of NSEW, one cell at a time, and the graph is unweighted (each step has a cost of 1). Linked below is a map representing the agent's (yellow guy) current belief of the environment at the instant of planning. Time does not pass in the simulation while the agent is planning.

http://imagizer.imageshack.us/a/img913/9274/qRsazT.jpg

What exploration algorithm can I use to maximise the cost-efficiency of utility, given that revisiting cells are allowed? Each cell holds a utility value. Ideally, I would seek to maximise the sum of utility of all cells SEEN (not visited) divided by the path length, although if that is too complex for any suitable algorithm then the number of cells seen will suffice. There is a maximum path length but it is generally in the hundreds or higher. (The actual test environments used on my agent are at least 4x bigger, although theoretically there is no upper bound on the dimensions that can be set, and the maximum path length would thus increase accordingly)

I consider BFS and DFS to be intractable, A* to be non-optimal given a lack of suitable heuristics, and Dijkstra's inappropriate in generating a single unbroken path. Is there any algorithm you can think of? Also, I need help with loop detection, as I've never done that before since allowing revisitations is my first time.

One approach I have considered is to reduce the map into a spanning tree, except that instead of defining it as a tree that connects all cells, it is defined as a tree that can see all cells. My approach would result in the following:

http://imagizer.imageshack.us/a/img910/3050/HGu40d.jpg

In the resultant tree, the agent can go from a node to any adjacent nodes that are 0-1 turn away at intersections. This is as far as my thinking has gotten right now. A solution generated using this tree may not be optimal, but it should at least be near-optimal with much fewer cells being processed by the algorithm, so if that would make the algorithm more likely to be tractable, then I guess that is an acceptable trade-off. I'm still stuck with thinking how exactly to generate a path for this however.

like image 784
thegreatjedi Avatar asked Nov 01 '22 06:11

thegreatjedi


1 Answers

Your problem is very similar to a canonical Reinforcement Learning (RL) problem, the Grid World. I would formalize it as a standard Markov Decision Process (MDP) and use any RL algorithm to solve it.

The formalization would be:

  • States s: your NxM discrete grid.
  • Actions a: UP, DOWN, LEFT, RIGHT.
  • Reward r: the value of the cells that the agent can see from the destination cell s', i.e. r(s,a,s') = sum(value(seen(s')).
  • Transition function: P(s' | s, a) = 1 if s' is not out of the boundaries or a black cell, 0 otherwise.

Since you are interested in the average reward, the discount factor is 1 and you have to normalize the cumulative reward by the number of steps. You also said that each step has cost one, so you could subtract 1 to the immediate reward rat each time step, but this would not add anything since you will already average by the number of steps.

Since the problem is discrete the policy could be a simple softmax (or Gibbs) distribution.

As solving algorithm you can use Q-learning, which guarantees the optimality of the solution provided a sufficient number of samples. However, if your grid is too big (and you said that there is no limit) I would suggest policy search algorithms, like policy gradient or relative entropy (although they guarantee convergence only to local optima). You can find something about Q-learning basically everywhere on the Internet. For a recent survey on policy search I suggest this.

The cool thing about these approaches is that they encode the exploration in the policy (e.g., the temperature in a softmax policy, the variance in a Gaussian distribution) and will try to maximize the cumulative long term reward as described by your MDP. So usually you initialize your policy with a high exploration (e.g., a complete random policy) and by trial and error the algorithm will make it deterministic and converge to the optimal one (however, sometimes also a stochastic policy is optimal). The main difference between all the RL algorithms is how they perform the update of the policy at each iteration and manage the tradeoff exploration-exploitation (how much should I explore VS how much should I exploit the information I already have).

As suggested by Demplo, you could also use Genetic Algorithms (GA), but they are usually slower and require more tuning (elitism, crossover, mutation...).

I have also tried some policy search algorithms on your problem and they seems to work well, although I initialized the grid randomly and do not know the exact optimal solution. If you provide some additional details (a test grid, the max number of steps and if the initial position is fixed or random) I can test them more precisely.

like image 139
Simon Avatar answered Nov 15 '22 10:11

Simon