I have implemented Q-Learning as described in, http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf In order to approx. Q(S,A) I use a neural network structure like the following, <ul> <li>Activation sigmoid</li> <li>Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)</li> <li>Outputs, single output. Q-Value</li> <li>N number of M Hidden Layers.</li> <li>Exploration method random 0 < rand() < propExplore</li> </ul> At each learning iteration using the following formula, <img src="https://i.stack.imgur.com/e3hgc.png" alt="enter image description here"> I calculate a Q-Target value then calculate an error using, <pre class="prettyprint"><code>error = QTarget - LastQValueReturnedFromNN </code></pre> and back propagate the error through the neural network. Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action. Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1) Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.

Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way. Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one. Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.

Questions about Q-Learning using Neural Networks

Q: Why do we use a neural network for Q-learning?

A neural network becomes helpful in approximating the Q-value function in deep Q-learning. The state gets taken as the input and Q-values for all possible actions get generated as the output. The following steps are involved in reinforcement learning using deep Q-learning networks (DQNs):

Q: What are Q-learning limitations?

One of the main limitations of Q-learning is that it can require a large amount of training data in order to converge to the optimal policy. Additionally, Q-learning can be very slow when learning from scratch in large or continuous action spaces.

Q: How does learning rate affect Q-learning?

The parameters used in the Q-value update process are: - the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.

Tags:

artificial-intelligence

machine-learning

neural-network

reinforcement-learning

q-learning

I have implemented Q-Learning as described in,

http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf

In order to approx. Q(S,A) I use a neural network structure like the following,

Activation sigmoid
Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
Outputs, single output. Q-Value
N number of M Hidden Layers.
Exploration method random 0 < rand() < propExplore

At each learning iteration using the following formula,

enter image description here

I calculate a Q-Target value then calculate an error using,

error = QTarget - LastQValueReturnedFromNN

and back propagate the error through the neural network.

Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.

Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)

Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.

980

asked Dec 07 '14 08:12

Hamza Yerlikaya

1 Answers

Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way.

Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one.

Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.

answered Oct 17 '22 02:10

Juan Leni

Related questions
                            
                                Is there any way to train a sklearn model by disk data like HDF5 or such ?
                            
                                xgboost predict method returns the same predicted value for all rows
                            
                                How to get a concurrency of 1000 requests with Flask and Gunicorn [closed]
                            
                                Run model in reverse in Keras
                            
                                One dimensional data with CNN
                            
                                AttributeError: module 'tensorflow.contrib.learn' has no attribute 'TensorFlowDNNClassifier'
                            
                                How to create my own datasets using in scikit-learn?
                            
                                AttributeError:'Tensor' object has no attribute '_keras_history'
                            
                                Add hand-crafted features to Keras sequential model
                            
                                How can you re-use a variable scope in tensorflow without a new scope being created by default?
                            
                                Pytorch: How to create an update rule that doesn't come from derivatives?
                            
                                Sigmoid output - can it be interpreted as probability?
                            
                                Difference between predict vs predict_proba in scikit-learn
                            
                                Weighted Decision Trees using Entropy
                            
                                Genetic Programming - Fitness functions
                            
                                Need a specific example of U-Matrix in Self Organizing Map
                            
                                Genetic algorithms: fitness function for feature selection algorithm
                            
                                Genetic algorithm example/tutorial for PyBrain?
                            
                                Why is KNN much faster than decision tree?
                            
                                Php machine-learning library? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With