Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neural Network and Temporal Difference Learning

I have a read few papers and lectures on temporal difference learning (some as they pertain to neural nets, such as the Sutton tutorial on TD-Gammon) but I am having a difficult time understanding the equations, which leads me to my questions.

-Where does the prediction value V_t come from? And subsequently, how do we get V_(t+1)?

-What exactly is getting back propagated when TD is used with a neural net? That is, where does the error that gets back propagated come from when using TD?

like image 243
ethnhll Avatar asked Apr 23 '14 05:04

ethnhll


People also ask

What is the difference between neural networks and machine learning?

What are the differences between machine learning and neural networks? Machine learning, a subset of artificial intelligence, refers to computers learning from data without being explicitly programmed. Neural networks are a specific type of machine learning model, which are used to make brain-like decisions.

What are the two types of learning in neural network?

Learning in ANN can be classified into three categories namely supervised learning, unsupervised learning, and reinforcement learning.

Why is temporal learning different?

Temporal Difference Learning in machine learning is a method to learn how to predict a quantity that depends on future values of a given signal. It can also be used to learn both the V-function and the Q-function, whereas Q-learning is a specific TD algorithm that is used to learn the Q-function.

What is temporal difference learning in AI?

Temporal Difference Learning (TD Learning) For example, in tic-tac-toe or others, we only know the reward(s) on the final move (terminal state). All other moves will have 0 immediate rewards. TD learning is an unsupervised technique to predict a variable's expected value in a sequence of states.


1 Answers

The backward and forward views can be confusing, but when you are dealing with something simple like a game-playing program, things are actually pretty simple in practice. I'm not looking at the reference you're using, so let me just provide a general overview.

Suppose I have a function approximator like a neural network, and that it has two functions, train and predict for training on a particular output and predicting the outcome of a state. (Or the outcome of taking an action in a given state.)

Suppose I have a trace of play from playing a game, where I used the predict method to tell me what move to make at each point and suppose that I lose at the end of the game (V=0). Suppose my states are s_1, s_2, s_3...s_n.

The monte-carlo approach says that I train my function approximator (e.g. my neural network) on each of the states in the trace using the trace and the final score. So, given this trace, you would do something like call:

train(s_n, 0) train(s_n-1, 0) ... train(s_1, 0).

That is, I'm asking every state to predict the final outcome of the trace.

The dynamic programming approach says that I train based on the result of the next state. So my training would be something like

train(s_n, 0) train(s_n-1, test(s_n)) ... train(s_1, test(s_2)).

That is, I'm asking the function approximator to predict what the next state predicts, where the last state predicts the final outcome from the trace.

TD learning mixes between the two of these, where λ=1 corresponds to the first case (monte carlo) and λ=0 corresponds to the second case (dynamic programming). Suppose that we use λ=0.5. Then our training would be:

train(s_n, 0) train(s_n-1, 0.5*0 + 0.5*test(s_n)) train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+) ...

Now, what I've written here isn't completely correct, because you don't actually re-test the approximator at each step. Instead you just start with a prediction value (V = 0 in our example) and then you update it for training the next step with the next predicted value. V = λ·V + (1-λ)·test(s_i).

This learns much faster than monte carlo and dynamic programming methods because you aren't asking the algorithm to learn such extreme values. (Ignoring the current prediction or ignoring the final outcome.)

like image 135
Nathan S. Avatar answered Oct 10 '22 02:10

Nathan S.