Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a target network required?

Tags:

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”

I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.

The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by

Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]

Where alpha and gamma are learning and discount factors. I can understand that the reinforcement learning algorithm will become unstable and diverge.

  • The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.

  • This is where I fail.

  • Let me break the paragraph from the paper down here for discussion
    • The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
    • and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
    • We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.

So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?

But I do not understand how it is going to solve it?

like image 895
tandem Avatar asked Jan 17 '19 13:01

tandem


People also ask

What are the advantages of having two networks DQN?

Using a separate target network, updated every so many steps with a copy of the latest learned parameters, helps keep runaway bias from bootstrapping from dominating the system numerically, causing the estimated Q values to diverge.

Is DQN obsolete?

It's simplicity, robustness, speed and the achievement of higher scores in standard RL tasks made policy gradients and DQN obsolete.

Why do we need experience replay?

Experience replay is a crucial component of off-policy deep reinforcement learning algorithms, improving the sample efficiency and stability of training by storing the previous environment interactions experienced by an agent [1].

What is the advantage of using a replay memory?

A key reason for using replay memory is to break the correlation between consecutive samples. If the network learned only from consecutive samples of experience as they occurred sequentially in the environment, the samples would be highly correlated and would therefore lead to inefficient learning.


1 Answers

So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?

The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.

This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.

Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.


On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

like image 125
Nick Larsen Avatar answered Oct 06 '22 06:10

Nick Larsen