Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding The Value Iteration Algorithm of Markov Decision Processes

In learning about MDP's I am having trouble with value iteration. Conceptually this example is very simple and makes sense:

If you have a 6 sided dice, and you roll a 4 or a 5 or a 6 you keep that amount in $ but if you roll a 1 or a 2 or a 3 you loose your bankroll and end the game.

In the beginning you have $0 so the choice between rolling and not rolling is:

k = 1
If I roll : 1/6*0 + 1/6*0 + 1/6*0 + 1/6*4 + 1/6*5 + 1/6*6 = 2.5 
I I don't roll : 0
since 2.5 > 0 I should roll

k = 2:
If I roll and get a 4:
    If I roll again: 4 + 1/6*(-4) + 1/6*(-4) + 1/6*(-4) + 1/6*4 + 1/6*5 + 1/6*6 = 4.5
    If I don't roll: 4
    since 4.5 is greater than 4 I should roll

If I roll and get a 5:
    If I roll again: 5 + 1/6*(-5) + 1/6*(-5) + 1/6*(-5) + 1/6*4 + 1/6*5 + 1/6*6 = 5
    If I don't roll: 5
    Since the difference is 0 I should not roll

If I roll and get a 6:
    If I roll again: 6 + 1/6*(-6) + 1/6*(-5) + 1/6*(-5) + 1/6*4 + 1/6*5 + 1/6*6 = 5.5
    If I don't roll: 6
    Since the difference is -0.5 I should not roll

What I am having trouble with is converting that into python code. Not because I am not good with python, but maybe my understanding of the pseudocode is wrong. Even though the Bellman equation does make sense to me.

I borrowed the Berkley code for value iteration and modified it to:

isBadSide = [1,1,1,0,0,0]

def R(s):
    if isBadSide[s-1]:
        return -s
    return s

def T(s, a, N):
    return [(1./N, s)]

def value_iteration(N, epsilon=0.001):
    "Solving an MDP by value iteration. [Fig. 17.4]"
    U1 = dict([(s, 0) for s in range(1, N+1)])
    while True:
        U = U1.copy()
        delta = 0
        for s in range(1, N+1):
            U1[s] = R(s) + max([sum([p * U[s1] for (p, s1) in T(s, a, N)])
                                        for a in ('s', 'g',)])

            delta = max(delta, abs(U1[s] - U[s]))

        if delta < epsilon:
             return U

    print(value_iteration(6))
    # {1: -1.1998456790123457, 2: -2.3996913580246915, 3: -3.599537037037037, 4: 4.799382716049383, 5: 5.999228395061729, 6: 7.199074074074074}

Which is the wrong answer. Where is the bug in this code? Or is it an issue of my understanding of the algorithm?

like image 328
Sam Hammamy Avatar asked Oct 17 '22 06:10

Sam Hammamy


1 Answers

Let B be your current balance.

If you choose to roll, the expected reward is 2.5 - B * 0.5.

If you choose not to roll, the expected reward is 0.

So, the policy is this: If B < 5, roll. Otherwise, don't.

And the expected reward on each step when following that policy is V = max(0, 2.5 - B * 0.5).


Now, if you want to express it in terms of the Bellman equation, you need to incorporate the balance into the state.

Let the state <Balance, GameIsOver> consist of the current balance and the flag that defines whether the game is over.

  • Action stop:
    • turns the state <B, false> into <B, true>
  • Action roll:
    • turns <B, false> into <0, true> with the probability 1/2
    • turns <B, false> into <B + 4, false> with the probability 1/6
    • turns <B, false> into <B + 5, false> with the probability 1/6
    • turns <B, false> into <B + 6, false> with the probability 1/6
  • No action can turn <B1, true> into <B2, false>

Using the notation from here:

π(<B, false>) = "roll", if B < 5

π(<B, false>) = "stop", if B >= 5

V(<B, false>) = 2.5 - B * 0.5, if B < 5

V(<B, false>) = 0, if B >= 5

like image 99
Anton Avatar answered Oct 21 '22 10:10

Anton