Stuck in understanding the difference between update usels of TD(0) and TD(λ)

Tags:

I'm studying Temporal difference learning from this post. Here the update rule of TD(0) is clear to me but in TD(λ), I don't understand how utility values of all the previous states are updated in a single update.

Here is the diagram given for comparison of bot updates:

enter image description here

above diagram is explained as following:

In TD(λ) the result is propagated back to all the previous states thanks to the eligibility traces.

My question is how the information is propagated to all the previous states in a single update even if we are using following update rule with eligibility traces?

enter image description here

Here in a single update, we're only updating utility of a single state Ut(s), then how the utilities of all the previous states are getting updated?

Edit

As per the answer, it is clear that this update is applied for every single step and that's the reason why information is propagated. If this is the case, then it again confuses me because, the only difference between the update rules is eligibility trace.

So even if the value of eligibility trace is non zero for previous states, the values of delta will be zero in above case (because initially rewards and utility function is initialized 0). Then how is it possible for previous states to get other utility values than zero in first update?

Also in the given python implementation, following output is given after a single iteration:

 [[ 0.       0.04595  0.1      0.     ]
 [ 0.       0.       0.       0.     ]
 [ 0.       0.       0.       0.     ]]

Here only 2 values are updated instead of all 5 previous states as shown in the figure. What I'm missing here?

517

asked Sep 02 '18 10:09

Kaushal28

2 Answers

You are missing a small but important detail, the update rule is applied to all states, not only the current state. So, in practice, you are updating all the states whose e_t(s) is different from zero.

Edit

The delta is not zero because is computed for the current state, when the episode ends and the agent receives a reward of +1. Therefore, after computing a delta different from zero, you update all the states using that delta and the current eligibility traces.

I don't know why in the Python implementation (I haven't checked it carefully) the output updates only 2 values, but please verify that the eligibility traces for all the 5 previous states are different from 0, and if it's not the case, try to understand why. Sometimes you are not interested in maintaining the traces under a very small threshold (e.g., 10e-5), because it has a very small effect in the learning process and it's wasting computational resources.

answered Sep 20 '22 17:09

Pablo EM

So even if the value of eligibility trace is non zero for previous states, the values of delta will be zero in above case (because initially rewards and utility function is initialized 0). Then how is it possible for previous states to get other utility values than zero in first update?

You are right that, in the first update, all rewards and updates will still be 0 (except if we already manage to reach the goal in a single step, then the reward won't be 0).

However, the eligibility traces e_t will continue to "remember" or "memorize" all the states that we have previously visited. So, as soon as we do manage to reach the goal state and get a non-zero reward, the eligibility traces will still remember all the states that we went through. Those states will still have non-zero entries in the table of eligibility traces, and therefore all at once get a non-zero update as soon as you observe your first reward.

The table of eligibility traces does decay every time step (multiplied by gamma * lambda_), so the magnitude of updates to states that were visited a long time ago will be smaller than the magnitude of updates to states that we visited very recently, but we will continue to remember all those states, they will have non-zero entries (under the assumption that gamma > 0 and lambda_ > 0). This allows for values of all visited states to be updated, not as soon as we reach those states, but as soon as we observe a non-zero reward (or, in epochs after the first epoch, as soon as we reach a state for which we already have an existing non-zero predicted value) after having visited them at some earlier point in time.

Also in the given python implementation, following output is given after a single iteration:
[[ 0.       0.04595  0.1      0.     ]
 [ 0.       0.       0.       0.     ]
 [ 0.       0.       0.       0.     ]]
Here only 2 values are updated instead of all 5 previous states as shown in the figure. What I'm missing here?

The first part of their code looks as follows:

for epoch in range(tot_epoch):
  #Reset and return the first observation
  observation = env.reset(exploring_starts=True)

So, every new epoch, they start by resetting the environment using the exploring_starts flag. If we look at the implementation of their environment, we see that the usage of this flag means that we always start out with a random initial position.

So, I suspect that, when the code was run to generate that output, the initial position was simply randomly selected to be the position two steps to the left of the goal, rather than the position in the bottom-left corner. If the initial position is randomly selected to already be closer to the goal, the agent only ever visits those two states for which you see non-zero updates, so those will also be the only states with non-zero entries in the table of eligibility traces and therefore be the only states with non-zero updates.

If the initial position is indeed the position in the bottom-left corner, a correct implementation of the algorithm would indeed update the values for all the states along that path (assuming no extra tricks are added, like setting entries to 0 if they happen to get "close enough" to 0 due to decaying).

I'd also like to note that there is in fact a mistake in the code on that page: they do not reset all the entries of the table of eligibility traces to 0 when resetting the environment / starting a new epoch. This should be done. If this is not done, the eligibility traces will still memorize states that were visited during previous epochs and also still update all of those, even if they were not visited again in the new epoch. This is incorrect. A correct version of their code should start out like this:

for epoch in range(tot_epoch):
    #Reset and return the first observation
    observation = env.reset(exploring_starts=True)
    trace_matrix = trace_matrix * 0.0    # IMPORTANT, added this        
    for step in range(1000):
        ...

answered Sep 20 '22 17:09

Dennis Soemers

Related questions
                            
                                How does one have parameters in a pytorch model not be leafs and be in the computation graph?
                            
                                How to vectorize equations?
                            
                                Any visualizations of neural network decision process when recognizing images?
                            
                                What machine learning algorithm is appropriate for predicting one time-series from another?
                            
                                Using multiple features with scikit-learn
                            
                                How to plot ROC curve with scikit learn for the multiclass case?
                            
                                Why there is the need of using regularization in machine learning problems?
                            
                                What is the issue in my calculation of Multivariate Kernel Estimation?
                            
                                keras model.fit_generator() several times slower than model.fit()
                            
                                Imputer reduces the size of columns in my dataframe
                            
                                How do I keep track of the time the CPU is used vs the GPUs for deep learning?
                            
                                Hot to fix Tensorflow model not running in Eager mode with .fit()?
                            
                                Why I am getting DatasetV1Adapter return type instead of TensorSliceDataset for tf.data.Dataset.from_tensor_slices(X)
                            
                                How to use the PyTorch Transformer with multi-dimensional sequence-to-seqence?
                            
                                How can I work with my own dataset in scikit-learn (for computer vision)?
                            
                                sklearn - cross validation with precision scoring for a subset of classes
                            
                                Keras + Tensorflow strange results
                            
                                TensorFlow estimator.predict() gives WARNING:tensorflow:Input graph does not contain a QueueRunner
                            
                                Validation loss when using Dropout
                            
                                How to predict integer values using ML.NET?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stuck in understanding the difference between update usels of TD(0) and TD(λ)

Tags:

machine-learning

reinforcement-learning

temporal-difference

Kaushal28

People also ask

2 Answers

Pablo EM

Dennis Soemers

Recent Activity

Donate For Us