Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deep Q Network is not learning

I tried to code a Deep Q Network to play Atari games using Tensorflow and OpenAI's Gym. Here's my code:

import tensorflow as tf
import gym
import numpy as np
import os

env_name = 'Breakout-v0'
env = gym.make(env_name)
num_episodes = 100
input_data = tf.placeholder(tf.float32,(None,)+env.observation_space.shape)
output_labels = tf.placeholder(tf.float32,(None,env.action_space.n))

def convnet(data):
    layer1 = tf.layers.conv2d(data,32,5,activation=tf.nn.relu)
    layer1_dropout = tf.nn.dropout(layer1,0.8)
    layer2 = tf.layers.conv2d(layer1_dropout,64,5,activation=tf.nn.relu)
    layer2_dropout = tf.nn.dropout(layer2,0.8)
    layer3 = tf.layers.conv2d(layer2_dropout,128,5,activation=tf.nn.relu)
    layer3_dropout = tf.nn.dropout(layer3,0.8)
    layer4 = tf.layers.dense(layer3_dropout,units=128,activation=tf.nn.softmax,kernel_initializer=tf.zeros_initializer)
    layer5 = tf.layers.flatten(layer4)
    layer5_dropout = tf.nn.dropout(layer5,0.8)
    layer6 = tf.layers.dense(layer5_dropout,units=env.action_space.n,activation=tf.nn.softmax,kernel_initializer=tf.zeros_initializer)
    return layer6

logits = convnet(input_data)
loss = tf.losses.sigmoid_cross_entropy(output_labels,logits)
train = tf.train.GradientDescentOptimizer(0.001).minimize(loss)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
discount_factor = 0.5

with tf.Session() as sess:
    sess.run(init)
    for episode in range(num_episodes):
        x = []
        y = []
        state = env.reset()
        feed = {input_data:np.array([state])}
        print('episode:', episode+1)
        while True:
            x.append(state)
            if (episode+1)/num_episodes > np.random.uniform():
                Q = sess.run(logits,feed_dict=feed)[0]
                action = np.argmax(Q)
            else:
                action = env.action_space.sample()
            state,reward,done,info = env.step(action)
            Q = sess.run(logits,feed_dict=feed)[0]
            new_Q = np.zeros(Q.shape)
            new_Q[action] = reward+np.amax(Q)*discount_factor
            y.append(new_Q)
            if done:
                break

        for sample in range(len(x)):
            _,l = sess.run([train,loss],feed_dict={input_data:[x[sample]],output_labels:[y[sample]]})
            print('training loss on sample '+str(sample+1)+': '+str(l))
    saver.save(sess,os.getcwd()+'/'+env_name+'-DQN.ckpt')

The Problem is that:

  1. The loss isn't decreasing while training and is always somewhere around 0.7 or 0.8
  2. When I test the network on the Breakout environment even after I trained it for 1000 episodes, the actions still seem kind of random and it rarely hits the ball.

I already tried to use different loss functions (softmax crossentropy and mean squared error), use another optimizer (Adam) and increasing the learning rate but nothing changed.

Can someone tell me how to fix this?

like image 702
Kay Jersch Avatar asked Apr 15 '18 10:04

Kay Jersch


People also ask

Is Deep Q-learning model based?

A typical example of model-based reinforcement learning is the Deep Q Network.

What is the difference between Q-learning and deep Q-learning?

A core difference between Deep Q-Learning and Vanilla Q-Learning is the implementation of the Q-table. Critically, Deep Q-Learning replaces the regular Q-table with a neural network. Rather than mapping a state-action pair to a q-value, a neural network maps input states to (action, Q-value) pairs.


2 Answers

Here are some things that stand out that you could look into (it's always difficult in these kinds of cases to tell for sure without trying exactly which issue(s) is/are the most important ones):

  • 100 episodes does not seem like a lot. In the image below, you see learning curves of some variants of Double DQN (slightly more advanced than DQN) on Breakout (source). Training time on the x-axis is measured in millions of frames there, not in episodes. I don't know exactly where 100 episodes would be on that x-axis, but I don't think it would be far in. It may simply not be reasonable to expect any kind of decent performance yet after 100 episodes.

OpenAI Baselines DQN Learning Curves Breakout
(source: openai.com)

  • It looks like you're using dropout in your networks. I'd recommend getting rid of the dropout. I don't know 100% for sure that it's bad to use dropout in Deep Reinforcement Learning, but 1) it's certainly not common, and 2) intuitively it doesn't seem necessary. Dropout is used to combat overfitting in supervised learning, but overfitting is not really much of a risk in Reinforcement Learning (at least, not if you're just trying to train for a single game at a time like you are here).

  • discount_factor = 0.5 seems extremely low, this is going to make it impossible to propagate long-term rewards back to more than a handful of actions. Something along the lines of discount_factor = 0.99 would be much more common.

  • if (episode+1)/num_episodes > np.random.uniform():, this code looks like it's essentially decaying epsilon from 1.0 - 1 / num_episodes in the first episode to 1.0 - num_episodes / num_episodes = 0.0 in the last episode. With your current num_episodes = 100, this means it's decaying from 0.99 to 0.0 over 100 episodes. That seems to me like it's decaying way too quickly. For reference, in the original DQN paper, epsilon is slowly decayed linearly from 1.0 to 0.1 over 1 million frames, and kept fixed forever after.

  • You're not using Experience Replay, and not using a separate Target network, as described in the original DQN paper. All of the points above are significantly easier to look into and fix, so I'd recommend that first. That might already be enough to actually start seeing some better-than-random performance after learning, but will likely still perform worse than it would with these two additions.

like image 89
Dennis Soemers Avatar answered Oct 07 '22 04:10

Dennis Soemers


At first let us describe the phenomena in detail. The error function of the neural network can have a value between 1.0 (maximum error) and 0.0 (goal). The idea behind a learning algorithm is to bring down the error function down to zero which means, that the agent plays the game perfect. At the beginning the learning works well, the error value is reducing, but then the curve is parallel at a certain level. That means the CPU is calculating huge amount of data, the CPU consumes energy, but the error value is not reducing anymore.

The good news is, that it has nothing to do with your sourcecode. Your implementation of the Deep Q network is great, I would even assume, that your sourcecode looks better than the code of the average programmer. The problem has to do with the difficulty of the environment in OpenAI gym. That means, on easy games like “bring the player to a goal position” the network learns good, while on difficult problems like “Montezuma's Revenge” the above described problem with a constant error function is occurring. Overcome the problem is not so easy as it looks like. It is not a problem of finetuning a neural network but to invent a new way of handling complex games. In the literature strategies like hierarchical problem solving, natural language grounding and domain specific ontologies are used to overcome the problem.

like image 38
Manuel Rodriguez Avatar answered Oct 07 '22 04:10

Manuel Rodriguez