I'm using the DQN algorithm to train an agent in my environment, that looks like this:
I have already adjusted some hyperparameters (network architecture, exploration, learning rate) which gave me some descent results, but still not as good as it should/could be. The rewards per epiode are increasing during training. The Q-values are converging, too (see figure 1). However, for all different settings of hyperparameter the Q-loss is not converging (see figure 2). I assume, that the lacking convergence of the Q-loss might be the limiting factor for better results.
Q-value of one discrete action durnig training
Q-loss during training
I'm using a target network which is updated every 20k timesteps. The Q-loss is calculated as MSE.
Do you have ideas why the Q-loss is not converging? Does the Q-Loss have to converge for DQN algorithm? I'm wondering, why Q-loss is not discussed in most of the papers.
Despite the empirical success of the deep Q network (DQN) reinforcement learning algorithm and its variants, DQN is still not well understood and it does not guarantee convergence.
Error or loss is measured as the difference between the predicted and actual result. Loss function in DQN. Source:[1] In a DQN, we can represent our loss function as a squared error of the target Q value and prediction Q value.
DQN is a general-purpose, model-free algorithm and has been proven to perform well in a variety of tasks including Atari 2600 games since it's first proposed by Minh et el. However, like many other reinforcement learning (RL) algorithms, DQN suffers from poor sample efficiency when rewards are sparse in an environment.
Deep-Q-learning is a value based method while Policy Gradient is a policy based method. It can learn the stochastic policy ( outputs the probabilities for every action ) which is useful for handling the exploration/exploitation trade off. Often π is simpler than V or Q.
Yes, the loss must coverage, because of the loss value means the difference between expected Q value and current Q value. Only when loss value converges, the current approaches optimal Q value. If it diverges, this means your approximation value is less and less accurate.
Maybe you can try adjusting the update frequency of the target network or check the gradient of each update (add gradient clipping). The addition of the target network increases the stability of the Q-learning.
In Deepmind's 2015 Nature paper, it states that:
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the traget yj in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q' and use Q' for generating the Q-learning targets yj for the following C updates to Q. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(st,at) often also increases Q(st+1, a) for all a and hence also increases the target yj, possibly leading to oscillations or divergence of the policy. Generating the targets using the older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets yj, making divergence or oscillations much more unlikely.
Human-level control through deep reinforcement learning, Mnih et al., 2015
I've made an experiment for another person asked similar questions in the Cartpole environment, and the update frequency of 100 solves the problem (achieve a maximum of 200 steps).
When C (update frequency) = 2, Plotting of the average loss:
C = 10
C = 100
C = 1000
C = 10000
If the divergence of loss value is caused by gradient explode, you can clip the gradient. In Deepmind's 2015 DQN, the author clipped the gradient by limiting the value within [-1, 1]. In the other case, the author of Prioritized Experience Replay clip gradient by limiting the norm within 10. Here're the examples:
DQN gradient clipping:
optimizer.zero_grad()
loss.backward()
for param in model.parameters():
param.grad.data.clamp_(-1, 1)
optimizer.step()
PER gradient clipping:
optimizer.zero_grad()
loss.backward()
if self.grad_norm_clipping:
torch.nn.utils.clip_grad.clip_grad_norm_(self.model.parameters(), 10)
optimizer.step()
I think it's normal that the Q-loss is not converging as your data keeps changing when your policy updates. It is not the same as supervised learning where your data never changes and you can make multiple passes on your data to make sure your weights are well fitted with that data.
Another thing is I found out that slightly updating the target network at every timestep (soft update) worked better for me than updating it at each X timesteps (hard update).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With