alt text http://img693.imageshack.us/img693/724/markov.png
I'm a bit confused about some points here:
There is a pattern to dealing with most MDP problems, but I think you've probably omitted some information from the problem description, most likely it has to do with the state you're trying to reach, or the way an episode ends (what happens if you run off the edge of the grid). I've done my best to answer your questions, but I've appended a primer on the process I use to deal with these types of problems.
Firstly utility is a fairly abstract measure of how much you want to be in a given state. It's definitely possible to have two states with equal utility, even when you measure utility with simple heuristics (Euclidean or Manhattan distance). In this case, I assume that the utility value and reward are interchangeable.
In the long term, the objective in these types of problems tends to be, how do you maximise your expected (long term) reward? The learning rate, gamma, controls how much emphasis you place on the current state versus where you would like to end up - effectively you can think of gamma as a spectrum going from, 'do the thing the benefits me most in this timestep' to at the other extreme 'explore all my options, and go back to the best one'. Sutton and Barto in there book on reinforcement learning have some really nice explanations of how this works.
Before you get started, go back through the question and make sure that you can confidently answer the following questions.
So the answers to the questions?
Start State Action Final State Probability --------------------------------------------------- (0,0) E (0,0) 0.3 (0,0) E (1,0) 0.7 (0,0) E (2,0) 0 ... (0,0) E (0,1) 0 ... (0,0) E (4,4) 0 (0,0) N (0,0) 0.3 ... (4,4) W (3,4) 0.7 (4,4) W (4,4) 0.3
How can we check that this makes sense for this problem?
Edit. answering the request for the transition probabilities to the target state. The notation below assumes
P( v=(3,3) | u =(2,3), a=E ) = 0.7 P( v=(3,3) | u =(4,3), a=W ) = 0.7 P( v=(3,3) | u =(3,2), a=N ) = 0.7 P( v=(3,3) | u =(3,4), a=S ) = 0.7 P( v=(3,3) | u =(3,3) ) = 0.3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With