edit: The following seems also to be the case for FrozenLake-v0
. Please note that I'm not interested in simple Q-learning as I want to see solutions that work with continuous observation spaces.
I recently created the banana_gym
OpenAI environment. The scenario is the following:
You have a banana. It has to get sold within 2 days, because it will be bad on the 3rd day. You may choose the price x, but the banana will only be sold with a probability of
The reward is x - 1. If the banana is not sold on the third day, the reward is -1. (Intuition: You paid 1 Euro for the banana). Hence the environment is non-deterministic (stochastic).
Actions: You may set the price to anything in {0.00, 0.10, 0.20, ..., 2.00}
Observations: The remaining time (source)
I calculated the optimal policy:
Opt at step 1: price 1.50 has value -0.26 (chance: 0.28)
Opt at step 2: price 1.10 has value -0.55 (chance: 0.41)
which also matches my intuition: First try to sell the banana at a higher price because you know you have another try if you don't sell it. Then reduce the price to something above 0.00.
I'm pretty sure this one is correct, but for the sake of completeness
#!/usr/bin/env python
"""Calculate the optimal banana pricing policy."""
import math
import numpy as np
def main(total_time_steps, price_not_sold, chance_to_sell):
"""
Compare the optimal policy to a given policy.
Parameters
----------
total_time_steps : int
How often the agent may offer the banana
price_not_sold : float
How much do we have to pay if we don't sell until
total_time_steps is over?
chance_to_sell : function
A function that takes the price as an input and outputs the
probabilty that a banana will be sold.
"""
r = get_optimal_policy(total_time_steps,
price_not_sold,
chance_to_sell)
enum_obj = enumerate(zip(r['optimal_prices'], r['values']), start=1)
for i, (price, value) in enum_obj:
print("Opt at step {:>2}: price {:>4.2f} has value {:>4.2f} "
"(chance: {:>4.2f})"
.format(i, price, value, chance_to_sell(price)))
def get_optimal_policy(total_time_steps,
price_not_sold,
chance_to_sell=None):
"""
Get the optimal policy for the Banana environment.
This means for each time step, calculate what is the smartest price
to set.
Parameters
----------
total_time_steps : int
price_not_sold : float
chance_to_sell : function, optional
Returns
-------
results : dict
'optimal_prices' : List of best prices to set at a given time
'values' : values of the value function at a given step with the
optimal policy
"""
if chance_to_sell is None:
chance_to_sell = get_chance
values = [None for i in range(total_time_steps + 1)]
optimal_prices = [None for i in range(total_time_steps)]
# punishment if a banana is not sold
values[total_time_steps] = (price_not_sold - 1)
for i in range(total_time_steps - 1, -1, -1):
opt_price = None
opt_price_value = None
for price in np.arange(0.0, 2.01, 0.10):
p_t = chance_to_sell(price)
reward_sold = (price - 1)
value = p_t * reward_sold + (1 - p_t) * values[i + 1]
if (opt_price_value is None) or (opt_price_value < value):
opt_price_value = value
opt_price = price
values[i] = opt_price_value
optimal_prices[i] = opt_price
return {'optimal_prices': optimal_prices,
'values': values}
def get_chance(x):
"""
Get probability that a banana will be sold at a given price x.
Parameters
----------
x : float
Returns
-------
chance_to_sell : float
"""
return (1 + math.exp(1)) / (1. + math.exp(x + 1))
if __name__ == '__main__':
total_time_steps = 2
main(total_time_steps=total_time_steps,
price_not_sold=0.0,
chance_to_sell=get_chance)
The following DQN agent (implemented with Keras-RL) works for the CartPole-v0
environment, but learns the policy
1: Take action 19 (price= 1.90)
0: Take action 14 (price= 1.40)
for the Banana environment. It goes in the right direction, but it consistently learns that strategy and not the optimal strategy:
Why does the DQN agent not learn the optimal strategy?
Execute with:
$ python dqn.py --env Banana-v0 --steps 50000
Code for dqn.py
:
#!/usr/bin/env python
import numpy as np
import gym
import gym_banana
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam
from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import EpisodeParameterMemory
def main(env_name, nb_steps):
# Get the environment and extract the number of actions.
env = gym.make(env_name)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n
input_shape = (1,) + env.observation_space.shape
model = create_nn_model(input_shape, nb_actions)
# Finally, we configure and compile our agent.
memory = EpisodeParameterMemory(limit=2000, window_length=1)
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1.,
value_min=.1, value_test=.05,
nb_steps=1000000)
agent = DQNAgent(model=model, nb_actions=nb_actions, policy=policy,
memory=memory, nb_steps_warmup=50000,
gamma=.99, target_model_update=10000,
train_interval=4, delta_clip=1.)
agent.compile(Adam(lr=.00025), metrics=['mae'])
agent.fit(env, nb_steps=nb_steps, visualize=False, verbose=1)
# Get the learned policy and print it
policy = get_policy(agent, env)
for remaining_time, action in sorted(policy.items(), reverse=True):
print("{:>2}: Take action {:>2} (price={:>5.2f})"
.format(remaining_time, action, 2 / 20. * action))
def create_nn_model(input_shape, nb_actions):
"""
Create a neural network model which maps the input to actions.
Parameters
----------
input_shape : tuple of int
nb_actoins : int
Returns
-------
model : keras Model object
"""
model = Sequential()
model.add(Flatten(input_shape=input_shape))
model.add(Dense(32, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(512, activation='relu'))
model.add(Dense(nb_actions, activation='linear')) # important to be linear
print(model.summary())
return model
def get_policy(agent, env):
policy = {}
for x_in in range(env.TOTAL_TIME_STEPS):
action = agent.forward(np.array([x_in]))
policy[x_in] = action
return policy
def get_parser():
"""Get parser object for script xy.py."""
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
parser = ArgumentParser(description=__doc__,
formatter_class=ArgumentDefaultsHelpFormatter)
parser.add_argument("--env",
dest="environment",
help="OpenAI Gym environment",
metavar="ENVIRONMENT",
default="CartPole-v0")
parser.add_argument("--steps",
dest="steps",
default=10000,
type=int,
help="how steps are trained?")
return parser
if __name__ == "__main__":
args = get_parser().parse_args()
main(args.environment, args.steps)
If I interpret your code correctly, it looks to me like you're using 50K training steps:
$ python dqn.py --env Banana-v0 --steps 50000
But also have a warmup period of 50K steps by putting the following in the DQNAgent constructor:
nb_steps_warmup=50000
I believe that this means you're actually not doing any training at all, since the warmup period is only used to collect experience in the replay buffer, is that correct? If so, the solution would probably be as simple as reducing the number of warmup steps, or increasing the number of training steps.
For future reference (or in case I'm mistaken in my interpretation of your code above), I'd recommend always creating a learning curve plot (episode rewards on y-axis, training steps on x-axis). That's always useful to get an idea of what's happening, and can help you focus on debugging the important parts of your code. If the rewards aren't increasing at all, you know that it's not learning at all for whatever reason. If they do increase for a while, but then plateau, you can for example try reducing the learning rate. If they do increase and keep increasing all the way until the end, you know it probably simply hasn't converged yet and you can try increasing the number of training steps or increasing the learning rate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With