Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set a openai-gym environment start with a specific state not the `env.reset()`?

Today, when I was trying to implement an rl-agent under the environment openai-gym, I found a problem that it seemed that all agents are trained from the most initial state: env.reset(), i.e.

import gym

env = gym.make("CartPole-v0")
initial_observation = env.reset()  # <-- Note
done = False

while not done:
    action = env.action_space.sample()  
    next_observation, reward, done, info = env.step(action)

env.close()  # close the environment

So it is natural that the agent can behave down the route env.reset() -(action)-> next_state -(action)-> next_state -(action)-> ... -(action)-> done, this is an episode. But how can an agent start from a sepecific state like a middle state, then take an action from that state? For example, I sample an experience from the replay buffer, i.e. (s, a, r, ns, done), what if I want train the agent start directly from the state ns, and get an action with a Q-Network, then for an n-step steps forward. Something like that:

import gym

env = gym.make("CartPole-v0")
initial_observation = ns  # not env.reset() 
done = False

while not done:
    action = DQN(ns) 
    next_observation, reward, done, info = env.step(action)
    # n-step later or done is true, break

env.close()  # close the environment

But even though I set a variable initial_observation as ns, I think the agent or the env will not aware it at all. How can I tell the gym.env that I want set the initial observation as ns and let the agent know the specific start state, get continue train directly from that specific observation(get start with that specific environment)?

like image 630
Hu Xixi Avatar asked Sep 08 '19 06:09

Hu Xixi


3 Answers

AFAIK, the current implementation of most OpenAI gym envs (including the CartPole-v0 you have used in your question) doesn't implement any mechanism to init the environment in a given state.

However, it shouldn't be too complex to modify the CartPoleEnv.reset() method in order to accept an optional parameter that acts as initial state.

like image 81
Pablo EM Avatar answered Sep 27 '22 18:09

Pablo EM


I recommend you to use and adapt the following code to your needs, it works well and I used it in my AlphaZero implementation.

This example is for CartPole but you should be able to adapt it easily to other envs.

from copy import deepcopy

import gym
import numpy as np
from gym.spaces import Discrete, Dict, Box


class CartPole:
def __init__(self, config=None):
    self.env = gym.make("CartPole-v0")
    self.action_space = Discrete(2)
    self.observation_space = self.env.observation_space

def reset(self):
    return self.env.reset()

def step(self, action):
    obs, rew, done, info = self.env.step(action)
    return obs, rew, done, info

def set_state(self, state):
    self.env = deepcopy(state)
    obs = np.array(list(self.env.unwrapped.state))
    return obs

def get_state(self):
    return deepcopy(self.env)

def render(self):
    self.env.render()

def close(self):
    self.env.close()
like image 42
Valentin Macé Avatar answered Sep 27 '22 16:09

Valentin Macé


The reason why a direct assignment to env.state is not working, is because the gym environment generated is actually a gym.wrappers.TimeLimit object.

To achieve what you intended, you have to also assign the ns value to the unwrapped environment. So, something like this should do the trick:

env.reset()
env.state = env.unwrapped.state = ns

I would you suggest you extend the CartPole environment so the reset method does what you need. Then wrap your environment yourself. e.g.

from gym.envs.classic_control import CartPoleEnv

class ExtendedCartPoleEnv(CartPoleEnv):
    def reset(self):
        self.state = your_very_special_method()

        self.steps_beyond_done = None
        return np.array(self.state, dtype=np.float32)

max_episode_steps = 200
env = ExtendedCartPoleEnv()
env = TimeLimit(env, max_episode_steps)

I've just tweaked the original method found here.

You can also extend the original environment to change the behavior of self.reset to take an argument, but this is not the standard. The wrapped environment wouldn't take the argument and then you would need to call env.unwrapped.reset directly. This gets ugly because then env.step will complain that env.reset has not been called. etc. There are ways to make it happen, but then again, this diverges from what a regular gym environment is supposed to look like.

like image 27
Arge Avatar answered Sep 27 '22 16:09

Arge