I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors. I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
r + $gamma$ (argmax(Q(s’,a’))
— This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
Using a separate target network, updated every so many steps with a copy of the latest learned parameters, helps keep runaway bias from bootstrapping from dominating the system numerically, causing the estimated Q values to diverge.
It's simplicity, robustness, speed and the achievement of higher scores in standard RL tasks made policy gradients and DQN obsolete.
Experience replay is a crucial component of off-policy deep reinforcement learning algorithms, improving the sample efficiency and stability of training by storing the previous environment interactions experienced by an agent [1].
A key reason for using replay memory is to break the correlation between consecutive samples. If the network learned only from consecutive samples of experience as they occurred sequentially in the environment, the samples would be highly correlated and would therefore lead to inefficient learning.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With