I am working on an Agent
class in Python 2.7.11 that uses a Markov Decision Process (MDP) to search for an optimal policy π in a GridWorld
. I am implementing a basic value iteration for 100 iterations of all GridWorld
states using the following Bellman Equation:
This equation is derived from taking the maximum of a Q value function, which is what I am using within my program:
When constructing my Agent
, it is passed an MDP, which is an abstract class containing the following methods:
# Returns all states in the GridWorld
def getStates()
# Returns all legal actions the agent can take given the current state
def getPossibleActions(state)
# Returns all possible successor states to transition to from the current state
# given an action, and the probability of reaching each with that action
def getTransitionStatesAndProbs(state, action)
# Returns the reward of going from the current state to the successor state
def getReward(state, action, nextState)
My Agent
is also passed a discount factor, and a number of iterations. I am also making use of a dictionary
to keep track of my values. Here is my code:
class IterationAgent:
def __init__(self, mdp, discount = 0.9, iterations = 100):
self.mdp = mdp
self.discount = discount
self.iterations = iterations
self.values = util.Counter() # A Counter is a dictionary with default 0
for transition in range(0, self.iterations, 1):
states = self.mdp.getStates()
valuesCopy = self.values.copy()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value
valuesCopy.update({state: convergedValue})
self.values = valuesCopy
def computeQValueFromValues(self, state, action):
successors = self.mdp.getTransitionStatesAndProbs(state, action)
reward = self.mdp.getReward(state, action, successors)
qValue = 0
for successor, probability in successors:
# The Q value equation: Q*(a,s) = T(s,a,s')[R(s,a,s') + gamma(V*(s'))]
qValue += probability * (reward + (self.discount * self.values[successor]))
return qValue
This implementation is correct, though I am unsure why I need valuesCopy
to accomplish a successful update to my self.values
dictionary. I have tried the following to omit the copying, but it does not work since it returns slightly incorrect values:
for i in range(0, self.iterations, 1):
states = self.mdp.getStates()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value
self.values.update({state: convergedValue})
My question is why is including a copy of my self.values
dictionary necessary to update my values correctly when valuesCopy = self.values.copy()
makes a copy of the dictionary anyways every iteration? Shouldn't updating the values in the original result in the same update?
There's an algorithmic difference in having or not having the copy:
# You update your copy here, so the original will be used unchanged, which is not the
# case if you don't have the copy
valuesCopy.update({state: convergedValue})
# If you have the copy, you'll be using the old value stored in self.value here,
# not the updated one
qValue += probability * (reward + (self.discount * self.values[successor]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With