Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the total_timesteps parameter in stable-baselines' models

I'm reading through the original PPO paper and trying to match this up to the input parameters of the stable-baselines PPO2 model.

One thing I do not understand is the total_timesteps parameter in the learn method.

The paper mentions

One style of policy gradient implementation... runs the policy for T timesteps (where T is much less than the episode length)

While the stable-baselines documentation describes the total_timesteps parameter as

(int) The total number of samples to train on

Therefore I would think that T in the paper and total_timesteps in the documentation are the same parameter.

What I do not understand is the following:

  • Does total_timesteps always need to be less than or equal to the total number of available "frames" (samples) in an environment (say if I had a finite number of frames like 1,000,000). If so, why?

  • By setting total_timesteps to a number less than the number of available frames, what portion of the training data does the agent see? For example, if total_timesteps=1000, does the agent only ever see the first 1000 frames?

  • Is an episode defined as the total number of available frames, or is it defined as when the agent first "looses" / "dies"? If the latter, then how can you know in advance when the agent will die to be able set total_timesteps to a lesser value?

I'm still learning the terminology behind RL, so I hope I've been able to explain my question clearly above. Any help / tips would be very much welcomed.

like image 663
PyRsquared Avatar asked Jun 21 '19 09:06

PyRsquared


People also ask

What is Total_timesteps?

total_timesteps is the number of steps in total the agent will do for any environment. The total_timesteps can be across several episodes, meaning that this value is not bound to some maximum. Let's say you have an environment with more than 1000 timesteps.

What is N_steps?

n_steps – (int) The number of steps to run for each environment per update (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel) ent_coef – (float) Entropy coefficient for the loss calculation.

What is DummyVecEnv?

DummyVecEnv (env_fns)[source] Creates a simple vectorized wrapper for multiple environments, calling each environment in sequence on the current Python process.

What are stable baselines?

Stable Baselines is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.


1 Answers

According to the stable-baselines source code

  • total_timesteps is the number of steps in total the agent will do for any environment. The total_timesteps can be across several episodes, meaning that this value is not bound to some maximum.
  • Let's say you have an environment with more than 1000 timesteps. If you call the learn function once, you would only experience the first 1000 frames, and the remaining part of the episode is unknown. In many experiments, you know how many timesteps the environment should last (i.e CartPole), but for environments with unknown length, this becomes less useful. BUT. If you call the learn function twice and say the environment episode had 1500 frames, you would see a full episode + 50 % of the 2nd.
  • An episode is defined to the extent of when the terminal flag is set to true (In gym, this is often set after a max timestep as well) Many other RL implemntations use total_episodes instead so that you do not have to care about time step consideration, but again, the downside would be that you could end up with only running 1 episode if you hit an absorbing state.

The total timestep argument also use n_steps where number of updates is calculated based as follows:

n_updates = total_timesteps // self.n_batch

where n_batch is n_steps times the number of vectorised environments.

This means that if you were to have 1 environment running with n_step set to 32 and total_timesteps = 25000, you would do 781 updates to your policy during the learn call (excluding epochs, as PPO can do several updates on a single batch)

The lession is:

  • For unknown sized envs, you would have to play with this value. Maybe create a running average episode length and use this value
  • Where the episode length is known, set it to the desired number of episode you would like to train. However, it might be less because the agent might not (probably wont) reach max steps every time.
  • TLDR play with the value (treat it as a hyperparameter)

Hope this helps!

like image 125
Per Arne Andersen Avatar answered Oct 14 '22 18:10

Per Arne Andersen