I tried to code a Deep Q Network to play Atari games using Tensorflow and OpenAI's Gym. Here's my code:
import tensorflow as tf
import gym
import numpy as np
import os
env_name = 'Breakout-v0'
env = gym.make(env_name)
num_episodes = 100
input_data = tf.placeholder(tf.float32,(None,)+env.observation_space.shape)
output_labels = tf.placeholder(tf.float32,(None,env.action_space.n))
def convnet(data):
layer1 = tf.layers.conv2d(data,32,5,activation=tf.nn.relu)
layer1_dropout = tf.nn.dropout(layer1,0.8)
layer2 = tf.layers.conv2d(layer1_dropout,64,5,activation=tf.nn.relu)
layer2_dropout = tf.nn.dropout(layer2,0.8)
layer3 = tf.layers.conv2d(layer2_dropout,128,5,activation=tf.nn.relu)
layer3_dropout = tf.nn.dropout(layer3,0.8)
layer4 = tf.layers.dense(layer3_dropout,units=128,activation=tf.nn.softmax,kernel_initializer=tf.zeros_initializer)
layer5 = tf.layers.flatten(layer4)
layer5_dropout = tf.nn.dropout(layer5,0.8)
layer6 = tf.layers.dense(layer5_dropout,units=env.action_space.n,activation=tf.nn.softmax,kernel_initializer=tf.zeros_initializer)
return layer6
logits = convnet(input_data)
loss = tf.losses.sigmoid_cross_entropy(output_labels,logits)
train = tf.train.GradientDescentOptimizer(0.001).minimize(loss)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
discount_factor = 0.5
with tf.Session() as sess:
sess.run(init)
for episode in range(num_episodes):
x = []
y = []
state = env.reset()
feed = {input_data:np.array([state])}
print('episode:', episode+1)
while True:
x.append(state)
if (episode+1)/num_episodes > np.random.uniform():
Q = sess.run(logits,feed_dict=feed)[0]
action = np.argmax(Q)
else:
action = env.action_space.sample()
state,reward,done,info = env.step(action)
Q = sess.run(logits,feed_dict=feed)[0]
new_Q = np.zeros(Q.shape)
new_Q[action] = reward+np.amax(Q)*discount_factor
y.append(new_Q)
if done:
break
for sample in range(len(x)):
_,l = sess.run([train,loss],feed_dict={input_data:[x[sample]],output_labels:[y[sample]]})
print('training loss on sample '+str(sample+1)+': '+str(l))
saver.save(sess,os.getcwd()+'/'+env_name+'-DQN.ckpt')
The Problem is that:
I already tried to use different loss functions (softmax crossentropy and mean squared error), use another optimizer (Adam) and increasing the learning rate but nothing changed.
Can someone tell me how to fix this?
A typical example of model-based reinforcement learning is the Deep Q Network.
A core difference between Deep Q-Learning and Vanilla Q-Learning is the implementation of the Q-table. Critically, Deep Q-Learning replaces the regular Q-table with a neural network. Rather than mapping a state-action pair to a q-value, a neural network maps input states to (action, Q-value) pairs.
Here are some things that stand out that you could look into (it's always difficult in these kinds of cases to tell for sure without trying exactly which issue(s) is/are the most important ones):
x
-axis is measured in millions of frames there, not in episodes. I don't know exactly where 100 episodes would be on that x
-axis, but I don't think it would be far in. It may simply not be reasonable to expect any kind of decent performance yet after 100 episodes.
(source: openai.com)
It looks like you're using dropout in your networks. I'd recommend getting rid of the dropout. I don't know 100% for sure that it's bad to use dropout in Deep Reinforcement Learning, but 1) it's certainly not common, and 2) intuitively it doesn't seem necessary. Dropout is used to combat overfitting in supervised learning, but overfitting is not really much of a risk in Reinforcement Learning (at least, not if you're just trying to train for a single game at a time like you are here).
discount_factor = 0.5
seems extremely low, this is going to make it impossible to propagate long-term rewards back to more than a handful of actions. Something along the lines of discount_factor = 0.99
would be much more common.
if (episode+1)/num_episodes > np.random.uniform():
, this code looks like it's essentially decaying epsilon
from 1.0 - 1 / num_episodes
in the first episode to 1.0 - num_episodes / num_episodes = 0.0
in the last episode. With your current num_episodes = 100
, this means it's decaying from 0.99
to 0.0
over 100
episodes. That seems to me like it's decaying way too quickly. For reference, in the original DQN paper, epsilon
is slowly decayed linearly from 1.0
to 0.1
over 1 million frames, and kept fixed forever after.
You're not using Experience Replay, and not using a separate Target network, as described in the original DQN paper. All of the points above are significantly easier to look into and fix, so I'd recommend that first. That might already be enough to actually start seeing some better-than-random performance after learning, but will likely still perform worse than it would with these two additions.
At first let us describe the phenomena in detail. The error function of the neural network can have a value between 1.0 (maximum error) and 0.0 (goal). The idea behind a learning algorithm is to bring down the error function down to zero which means, that the agent plays the game perfect. At the beginning the learning works well, the error value is reducing, but then the curve is parallel at a certain level. That means the CPU is calculating huge amount of data, the CPU consumes energy, but the error value is not reducing anymore.
The good news is, that it has nothing to do with your sourcecode. Your implementation of the Deep Q network is great, I would even assume, that your sourcecode looks better than the code of the average programmer. The problem has to do with the difficulty of the environment in OpenAI gym. That means, on easy games like “bring the player to a goal position” the network learns good, while on difficult problems like “Montezuma's Revenge” the above described problem with a constant error function is occurring. Overcome the problem is not so easy as it looks like. It is not a problem of finetuning a neural network but to invent a new way of handling complex games. In the literature strategies like hierarchical problem solving, natural language grounding and domain specific ontologies are used to overcome the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With