Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Questions about Q-Learning using Neural Networks

I have implemented Q-Learning as described in,

http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf

In order to approx. Q(S,A) I use a neural network structure like the following,

  • Activation sigmoid
  • Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
  • Outputs, single output. Q-Value
  • N number of M Hidden Layers.
  • Exploration method random 0 < rand() < propExplore

At each learning iteration using the following formula,

enter image description here

I calculate a Q-Target value then calculate an error using,

error = QTarget - LastQValueReturnedFromNN

and back propagate the error through the neural network.

Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.

Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)

Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.

like image 980
Hamza Yerlikaya Avatar asked Dec 07 '14 08:12

Hamza Yerlikaya


People also ask

Why do we use a neural network for Q-learning?

A neural network becomes helpful in approximating the Q-value function in deep Q-learning. The state gets taken as the input and Q-values for all possible actions get generated as the output. The following steps are involved in reinforcement learning using deep Q-learning networks (DQNs):

What are Q-learning limitations?

One of the main limitations of Q-learning is that it can require a large amount of training data in order to converge to the optimal policy. Additionally, Q-learning can be very slow when learning from scratch in large or continuous action spaces.

How does learning rate affect Q-learning?

The parameters used in the Q-value update process are: - the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.


1 Answers

Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way.

Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one.

Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.

like image 99
Juan Leni Avatar answered Oct 17 '22 02:10

Juan Leni