Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reinforcement learning, why the performance collapsed?

I am trying to train an agent on ViZDoom platform on the deadly_corridor scenario with A3C algorithm and TensorFlow on TITAN X GPU server, however, the performance collapsed after training about 2+ days. As you can see in the following picture.

enter image description here

There are 6 demons in the corridor and the agent should kill at least 5 demons to get to the destination and get the vest.

Here is the code of the newtwork

with tf.variable_scope(scope):
    self.inputs = tf.placeholder(shape=[None, *shape, 1], dtype=tf.float32)
    self.conv_1 = slim.conv2d(activation_fn=tf.nn.relu, inputs=self.inputs, num_outputs=32,
                              kernel_size=[8, 8], stride=4, padding='SAME')
    self.conv_2 = slim.conv2d(activation_fn=tf.nn.relu, inputs=self.conv_1, num_outputs=64,
                              kernel_size=[4, 4], stride=2, padding='SAME')
    self.conv_3 = slim.conv2d(activation_fn=tf.nn.relu, inputs=self.conv_2, num_outputs=64,
                              kernel_size=[3, 3], stride=1, padding='SAME')
    self.fc = slim.fully_connected(slim.flatten(self.conv_3), 512, activation_fn=tf.nn.elu)

    # LSTM
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(cfg.RNN_DIM, state_is_tuple=True)
    c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
    h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
    self.state_init = [c_init, h_init]
    c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
    h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
    self.state_in = (c_in, h_in)
    rnn_in = tf.expand_dims(self.fc, [0])
    step_size = tf.shape(self.inputs)[:1]
    state_in = tf.contrib.rnn.LSTMStateTuple(c_in, h_in)
    lstm_outputs, lstm_state = tf.nn.dynamic_rnn(lstm_cell,
                                                 rnn_in,
                                                 initial_state=state_in,
                                                 sequence_length=step_size,
                                                 time_major=False)
    lstm_c, lstm_h = lstm_state
    self.state_out = (lstm_c[:1, :], lstm_h[:1, :])
    rnn_out = tf.reshape(lstm_outputs, [-1, 256])

    # Output layers for policy and value estimations
    self.policy = slim.fully_connected(rnn_out,
                                       cfg.ACTION_DIM,
                                       activation_fn=tf.nn.softmax,
                                       biases_initializer=None)
    self.value = slim.fully_connected(rnn_out,
                                      1,
                                      activation_fn=None,
                                      biases_initializer=None)
    if scope != 'global' and not play:
        self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
        self.actions_onehot = tf.one_hot(self.actions, cfg.ACTION_DIM, dtype=tf.float32)
        self.target_v = tf.placeholder(shape=[None], dtype=tf.float32)
        self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)

        self.responsible_outputs = tf.reduce_sum(self.policy * self.actions_onehot, axis=1)

        # Loss functions
        self.policy_loss = -tf.reduce_sum(self.advantages * tf.log(self.responsible_outputs+1e-10))
        self.value_loss = tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value, [-1])))
        self.entropy = -tf.reduce_sum(self.policy * tf.log(self.policy+1e-10))

        # Get gradients from local network using local losses
        local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
        value_var, policy_var = local_vars[:-2] + [local_vars[-1]], local_vars[:-2] + [local_vars[-2]]
        self.var_norms = tf.global_norm(local_vars)

        self.value_gradients = tf.gradients(self.value_loss, value_var)
        value_grads, self.grad_norms_value = tf.clip_by_global_norm(self.value_gradients, 40.0)

        self.policy_gradients = tf.gradients(self.policy_loss, policy_var)
        policy_grads, self.grad_norms_policy = tf.clip_by_global_norm(self.policy_gradients, 40.0)

        # Apply local gradients to global network
        global_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
        global_vars_value, global_vars_policy = \
            global_vars[:-2] + [global_vars[-1]], global_vars[:-2] + [global_vars[-2]]

        self.apply_grads_value = optimizer.apply_gradients(zip(value_grads, global_vars_value))
        self.apply_grads_policy = optimizer.apply_gradients(zip(policy_grads, global_vars_policy))

And the optimizer is

optimizer = tf.train.RMSPropOptimizer(learning_rate=1e-5)

And here are some summaries of the gradients and norms

enter image description here

Help some one can help me to tackle this problem.

like image 459
GoingMyWay Avatar asked Jun 24 '26 22:06

GoingMyWay


1 Answers

Now, personally, I think the reason why the performance of the agent collapsed is maybe the overoptimization of values. I read a paper on Double DQN on this, you can read this paper DEEP REINFORCEMENT LEARNING WITH DOUBLE Q-LEARNING

like image 62
GoingMyWay Avatar answered Jun 29 '26 11:06

GoingMyWay



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!