Neural Network and Temporal Difference Learning

Tags:

I have a read few papers and lectures on temporal difference learning (some as they pertain to neural nets, such as the Sutton tutorial on TD-Gammon) but I am having a difficult time understanding the equations, which leads me to my questions.

-Where does the prediction value V_t come from? And subsequently, how do we get V_(t+1)?

-What exactly is getting back propagated when TD is used with a neural net? That is, where does the error that gets back propagated come from when using TD?

243

asked Apr 23 '14 05:04

ethnhll

1 Answers

The backward and forward views can be confusing, but when you are dealing with something simple like a game-playing program, things are actually pretty simple in practice. I'm not looking at the reference you're using, so let me just provide a general overview.

Suppose I have a function approximator like a neural network, and that it has two functions, train and predict for training on a particular output and predicting the outcome of a state. (Or the outcome of taking an action in a given state.)

Suppose I have a trace of play from playing a game, where I used the predict method to tell me what move to make at each point and suppose that I lose at the end of the game (V=0). Suppose my states are s_1, s_2, s_3...s_n.

The monte-carlo approach says that I train my function approximator (e.g. my neural network) on each of the states in the trace using the trace and the final score. So, given this trace, you would do something like call:

train(s_n, 0) train(s_n-1, 0) ... train(s_1, 0).

That is, I'm asking every state to predict the final outcome of the trace.

The dynamic programming approach says that I train based on the result of the next state. So my training would be something like

train(s_n, 0) train(s_n-1, test(s_n)) ... train(s_1, test(s_2)).

That is, I'm asking the function approximator to predict what the next state predicts, where the last state predicts the final outcome from the trace.

TD learning mixes between the two of these, where λ=1 corresponds to the first case (monte carlo) and λ=0 corresponds to the second case (dynamic programming). Suppose that we use λ=0.5. Then our training would be:

train(s_n, 0) train(s_n-1, 0.5*0 + 0.5*test(s_n)) train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+) ...

Now, what I've written here isn't completely correct, because you don't actually re-test the approximator at each step. Instead you just start with a prediction value (V = 0 in our example) and then you update it for training the next step with the next predicted value. V = λ·V + (1-λ)·test(s_i).

This learns much faster than monte carlo and dynamic programming methods because you aren't asking the algorithm to learn such extreme values. (Ignoring the current prediction or ignoring the final outcome.)

135

answered Oct 10 '22 02:10

Nathan S.

Related questions
                            
                                Are decision trees trying to maximize information gain or entropy?
                            
                                Linear regression poor gradient descent performance
                            
                                What is the difference between Local beam search and Stochastic beam search?
                            
                                What is the relation between NEAT and reinforcement learning?
                            
                                How to train a model in tensorflow using java
                            
                                Overlap between mask and fired beams in Pygame [AI car model vision]
                            
                                XOR problem solvable with 2x2x1 neural network without bias?
                            
                                Performance of an A* search implemented in Clojure
                            
                                Training neural network for XOR in Ruby
                            
                                Bug in Minimax Algorithm for Tic Tac Toe
                            
                                Finding the best move using MinMax with Alpha-Beta pruning
                            
                                Calculating Nearest Match to Mean/Stddev Pair With LibSVM
                            
                                Minimax algorithm: Cost/evaluation function?
                            
                                Concatting a list of strings in Prolog
                            
                                Cannibals and missionaries using IDDFS and GreedyBFS
                            
                                A Simple Chess Minimax
                            
                                Combining Weak Learners into a Strong Classifier
                            
                                Confusion in propositional logic algorithm
                            
                                Breadth first search branching factor
                            
                                Genetic Algorithm, large population vs small one

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Neural Network and Temporal Difference Learning

Tags:

artificial-intelligence

neural-network

backpropagation

reinforcement-learning

temporal-difference

ethnhll

People also ask

1 Answers

Nathan S.

Recent Activity

Donate For Us