Deep Q Network is not learning

Tags:

I tried to code a Deep Q Network to play Atari games using Tensorflow and OpenAI's Gym. Here's my code:

import tensorflow as tf
import gym
import numpy as np
import os

env_name = 'Breakout-v0'
env = gym.make(env_name)
num_episodes = 100
input_data = tf.placeholder(tf.float32,(None,)+env.observation_space.shape)
output_labels = tf.placeholder(tf.float32,(None,env.action_space.n))

def convnet(data):
    layer1 = tf.layers.conv2d(data,32,5,activation=tf.nn.relu)
    layer1_dropout = tf.nn.dropout(layer1,0.8)
    layer2 = tf.layers.conv2d(layer1_dropout,64,5,activation=tf.nn.relu)
    layer2_dropout = tf.nn.dropout(layer2,0.8)
    layer3 = tf.layers.conv2d(layer2_dropout,128,5,activation=tf.nn.relu)
    layer3_dropout = tf.nn.dropout(layer3,0.8)
    layer4 = tf.layers.dense(layer3_dropout,units=128,activation=tf.nn.softmax,kernel_initializer=tf.zeros_initializer)
    layer5 = tf.layers.flatten(layer4)
    layer5_dropout = tf.nn.dropout(layer5,0.8)
    layer6 = tf.layers.dense(layer5_dropout,units=env.action_space.n,activation=tf.nn.softmax,kernel_initializer=tf.zeros_initializer)
    return layer6

logits = convnet(input_data)
loss = tf.losses.sigmoid_cross_entropy(output_labels,logits)
train = tf.train.GradientDescentOptimizer(0.001).minimize(loss)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
discount_factor = 0.5

with tf.Session() as sess:
    sess.run(init)
    for episode in range(num_episodes):
        x = []
        y = []
        state = env.reset()
        feed = {input_data:np.array([state])}
        print('episode:', episode+1)
        while True:
            x.append(state)
            if (episode+1)/num_episodes > np.random.uniform():
                Q = sess.run(logits,feed_dict=feed)[0]
                action = np.argmax(Q)
            else:
                action = env.action_space.sample()
            state,reward,done,info = env.step(action)
            Q = sess.run(logits,feed_dict=feed)[0]
            new_Q = np.zeros(Q.shape)
            new_Q[action] = reward+np.amax(Q)*discount_factor
            y.append(new_Q)
            if done:
                break

        for sample in range(len(x)):
            _,l = sess.run([train,loss],feed_dict={input_data:[x[sample]],output_labels:[y[sample]]})
            print('training loss on sample '+str(sample+1)+': '+str(l))
    saver.save(sess,os.getcwd()+'/'+env_name+'-DQN.ckpt')

The Problem is that:

The loss isn't decreasing while training and is always somewhere around 0.7 or 0.8
When I test the network on the Breakout environment even after I trained it for 1000 episodes, the actions still seem kind of random and it rarely hits the ball.

I already tried to use different loss functions (softmax crossentropy and mean squared error), use another optimizer (Adam) and increasing the learning rate but nothing changed.

Can someone tell me how to fix this?

702

asked Apr 15 '18 10:04

2 Answers

Here are some things that stand out that you could look into (it's always difficult in these kinds of cases to tell for sure without trying exactly which issue(s) is/are the most important ones):

100 episodes does not seem like a lot. In the image below, you see learning curves of some variants of Double DQN (slightly more advanced than DQN) on Breakout (source). Training time on the x-axis is measured in millions of frames there, not in episodes. I don't know exactly where 100 episodes would be on that x-axis, but I don't think it would be far in. It may simply not be reasonable to expect any kind of decent performance yet after 100 episodes.

OpenAI Baselines DQN Learning Curves Breakout
_{(source: openai.com)}

It looks like you're using dropout in your networks. I'd recommend getting rid of the dropout. I don't know 100% for sure that it's bad to use dropout in Deep Reinforcement Learning, but 1) it's certainly not common, and 2) intuitively it doesn't seem necessary. Dropout is used to combat overfitting in supervised learning, but overfitting is not really much of a risk in Reinforcement Learning (at least, not if you're just trying to train for a single game at a time like you are here).
discount_factor = 0.5 seems extremely low, this is going to make it impossible to propagate long-term rewards back to more than a handful of actions. Something along the lines of discount_factor = 0.99 would be much more common.
if (episode+1)/num_episodes > np.random.uniform():, this code looks like it's essentially decaying epsilon from 1.0 - 1 / num_episodes in the first episode to 1.0 - num_episodes / num_episodes = 0.0 in the last episode. With your current num_episodes = 100, this means it's decaying from 0.99 to 0.0 over 100 episodes. That seems to me like it's decaying way too quickly. For reference, in the original DQN paper, epsilon is slowly decayed linearly from 1.0 to 0.1 over 1 million frames, and kept fixed forever after.
You're not using Experience Replay, and not using a separate Target network, as described in the original DQN paper. All of the points above are significantly easier to look into and fix, so I'd recommend that first. That might already be enough to actually start seeing some better-than-random performance after learning, but will likely still perform worse than it would with these two additions.

answered Oct 07 '22 04:10

At first let us describe the phenomena in detail. The error function of the neural network can have a value between 1.0 (maximum error) and 0.0 (goal). The idea behind a learning algorithm is to bring down the error function down to zero which means, that the agent plays the game perfect. At the beginning the learning works well, the error value is reducing, but then the curve is parallel at a certain level. That means the CPU is calculating huge amount of data, the CPU consumes energy, but the error value is not reducing anymore.

The good news is, that it has nothing to do with your sourcecode. Your implementation of the Deep Q network is great, I would even assume, that your sourcecode looks better than the code of the average programmer. The problem has to do with the difficulty of the environment in OpenAI gym. That means, on easy games like “bring the player to a goal position” the network learns good, while on difficult problems like “Montezuma's Revenge” the above described problem with a constant error function is occurring. Overcome the problem is not so easy as it looks like. It is not a problem of finetuning a neural network but to invent a new way of handling complex games. In the literature strategies like hierarchical problem solving, natural language grounding and domain specific ontologies are used to overcome the problem.

answered Oct 07 '22 04:10

Manuel Rodriguez

Related questions
                            
                                Memory usage of neural network, Keras
                            
                                Is it possible to restore a tensorflow estimator from saved model?
                            
                                Why do I get error while trying to build an architecture with multiple inputs in Keras?
                            
                                Using TFRecords with keras
                            
                                Tensorflow dilation behave differently than morphological dilation
                            
                                RuntimeError: Unable to create link (name already exists) Keras
                            
                                Flutter TFLite Error: "metal_delegate.h" File Not Found
                            
                                Preprocess a Tensorflow tensor in Numpy
                            
                                Theano Dimshuffle equivalent in Google's TensorFlow?
                            
                                How can I implement a recursive neural network in TensorFlow?
                            
                                Creating log directory in tensorboard
                            
                                Where is target specified in tensorflow's load_csv function
                            
                                How to save&restore DNNClassifier trained in TensorFlow python; iris example
                            
                                Tensorflow: CUDA_VISIBLE_DEVICES doesn't seem to work
                            
                                Object Detection API error: "ImportError: cannot import name anchor_generator_pb2"
                            
                                tensorflow-gpu is not working with Blas GEMM launch failed
                            
                                Tensorflow save/restore batch norm
                            
                                How to visualize TensorFlow Estimator weights?
                            
                                Tensorflow Dataset.from_generator fails with pyfunc exception
                            
                                TensorFlow : How do i find my output node in my Tensorflow trained model?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Deep Q Network is not learning

Tags:

artificial-intelligence

neural-network

tensorflow

reinforcement-learning

q-learning

Kay Jersch

People also ask

2 Answers

Dennis Soemers

Manuel Rodriguez

Recent Activity

Donate For Us