Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between value iteration and policy iteration? [closed]

People also ask

What is the difference between value and policy?

The value function covers the part of evaluating the current situation of the agent in the environment and the policy, which describes the decision-making process of the agent.

What is the difference between value iteration and policy iteration methods in ReInforcement learning?

In policy iteration, we start with a fixed policy. Conversely, in value iteration, we begin by selecting the value function. Then, in both algorithms, we iteratively improve until we reach convergence. The policy iteration algorithm updates the policy.

Why does policy iteration converges faster than value iteration?

The reason that policy iteration is faster because - one policy can be represented by an infinite number of value functions, so in policy iteration when you jump from one policy to another .. you have essentially jumped over infinite number of value functions.

Will the optimal value function obtained from policy iteration and value iteration be different?

My value function from policy iteration looks vastly different from the values from value iteration, but the policy obtained from both is very similar.


Let's look at them side by side. The key parts for comparison are highlighted. Figures are from Sutton and Barto's book: Reinforcement Learning: An Introduction.

enter image description here Key points:

  1. Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
  2. Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
  3. Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep of all states regardless of convergence).
  4. The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
  5. Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.

In my experience, policy iteration is faster than value iteration, as a policy converges more quickly than a value function. I remember this is also described in the book.

I guess the confusion mainly came from all these somewhat similar terms, which also confused me before.


In policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.

In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the optimality Bellman operator.

In some sense, both algorithms share the same working principle, and they can be seen as two cases of the generalized policy iteration. However, the optimality Bellman operator contains a max operator, which is non linear and, therefore, it has different features. In addition, it's possible to use hybrid methods between pure value iteration and pure policy iteration.


The basic difference is -

In Policy Iteration - You randomly select a policy and find value function corresponding to it , then find a new (improved) policy based on the previous value function, and so on this will lead to optimal policy .

In Value Iteration - You randomly select a value function , then find a new (improved) value function in an iterative process, until reaching the optimal value function , then derive optimal policy from that optimal value function .

Policy iteration works on principle of “Policy evaluation —-> Policy improvement”.

Value Iteration works on principle of “ Optimal value function —-> optimal policy”.


The main difference in speed is due to the max operation in every iteration of value iteration (VI).

In VI, each state will use just one action (with the max utility value) for calculating the updated utility value, but it first has to calculate the value of all possible actions in order to find this action via the Bellman Equation.

In policy iteration (PI), this max operation is ommited in step 1 (policy evaluation) by just following the intermediate policy to choose the action.

If there are N possible actions, VI has to calculate the bellman equation N times for each state and then take the max, whereas PI just calculates it one time (for the action stated by the current policy).

However in PI, there is a policy improvement step that still uses the max operator and is as slow as a step in VI, but since PI converges in less iterations, this step won't happen as often as in VI.