In reinforcement learning, I'm trying to understand the difference between policy iteration, and value iteration. There are some general answers to this out there, but I have two specific queries which I cannot find an answer to.
1) I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean? I thought that both methods just take each state, then look at all the other states it can reach, and compute the value from this -- either by marginalising over the policy's action distribution (policy iteration) or by taking that argmax with regards to the action values (value iteration). So why is there any notion of the "direction" in which each method "moves"?
2) Policy iteration requires an iterative process during policy evaluation, to find the value function -- by However, value iteration just requires one step. Why is this different? Why does value iteration converge in just one step?
Thank you!
The answer provided by @Nick Walker is right and quite complete, however I would like to add a graphical explanation of the difference between Value iteration and Policy iteration, which maybe help to answer the second part of your question.
Both methods, PI and VI, follow the same working principle based in Generalized Policy Iteration. This basically means that they alternate between improving the policy (which requires knowing its value function), and computing the value function of the new, improved, policy.
At the end of this iterative process, both, the value and the policy, converge to the optimal.
However, it was notice that is not necessary to compute exactly the full value function, instead, one step is necessary to allow the convergence. In the following figure, (b) represents the operations performed by Policy Iteration, where the full value function is computed. While (d) shows how Value iteration works.
Obviously this representation of both methods are simplistic, but it highlights the difference between the key idea behind each algorithm.
I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean?
I can't find anything online that describes policy iteration and value iteration in terms of direction, and to my knowledge this is not a common way to explain the difference between them.
One possibility is that someone was referring to visual impression of values propagating in value iteration. After the first sweep, values are correct on a 1 timestep horizon. Each value correctly tells you what to do to maximize your cumulative reward if you have 1 tilmestep to live. This means that that states that transition to the terminal state and receive a reward have positive values while most everything else is 0. Each sweep, the values become correct for one timestep longer of a horizon. So the values creep backwards from the terminal state towards the start state as the horizon expands. In policy iteration, instead of just propagating values back one step, you calculate the complete value function for the current policy. Then you improve the policy and repeat. I can't say that this has a forward connotation to it, but it certainly lacks the backwards appearance. You may want to see Pablo's answer to a similar question for another explanation of the differences that may help you contextualize what you have heard.
It's also possible that you heard about this forwards-backwards contrast in regards to something related, but different; implementations of temporal difference learning algorithms. In this case, the direction refers to the direction in which you look when making an update to state-action values; forwards means you need to have information about the results of future actions, while backwards means you only need information about things that happened previously. You can read about this in chapter 12 of Reinforcement Learning: An Introduction 2nd edition.
Why does policy iteration have to do a bunch of value function calculations when value iteration seemingly just does one that ends up being optimal? Why does value iteration converge in just one step?
In policy evaluation, we already have a policy and we're just calculating the value of taking actions as it dictates. It repeatedly looks at each state and moves the state's value towards the values of the states that the policy's action will transition to (until the values stop changing and we consider it converged). This value function is not optimal. It's only useful because we can use it in combination with the policy improvement theorem to improve the policy. The expensive process of extracting a new policy, which requires us to maximize over actions in a state, happens infrequently, and policies seem to converge pretty quickly. So even though the policy evaluation step looks like it would be time consuming, PI is actually pretty fast.
Value iteration is just policy iteration where you do exactly one iteration of policy evaluation and extract a new policy at the same time (maximizing over actions is the implicit policy extraction). Then you repeat this iterate-extract procedure over and over until the values stop changing. The fact that these steps are merged together makes it look more straightforward on paper, but maximizing at each sweep is expensive and means that value iteration is often slower than policy iteration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With