Why can't my DQN agent find the optimal policy in a non-deterministic environment?

Question

edit: The following seems also to be the case for FrozenLake-v0. Please note that I'm not interested in simple Q-learning as I want to see solutions that work with continuous observation spaces.

I recently created the banana_gym OpenAI environment. The scenario is the following:

You have a banana. It has to get sold within 2 days, because it will be bad on the 3rd day. You may choose the price x, but the banana will only be sold with a probability of

enter image description here

The reward is x - 1. If the banana is not sold on the third day, the reward is -1. (Intuition: You paid 1 Euro for the banana). Hence the environment is non-deterministic (stochastic).

Actions: You may set the price to anything in {0.00, 0.10, 0.20, ..., 2.00}

Observations: The remaining time (source)

I calculated the optimal policy:

Opt at step  1: price 1.50 has value -0.26 (chance: 0.28)
Opt at step  2: price 1.10 has value -0.55 (chance: 0.41)

which also matches my intuition: First try to sell the banana at a higher price because you know you have another try if you don't sell it. Then reduce the price to something above 0.00.

Optimal policy calculation

I'm pretty sure this one is correct, but for the sake of completeness

#!/usr/bin/env python

"""Calculate the optimal banana pricing policy."""

import math
import numpy as np


def main(total_time_steps, price_not_sold, chance_to_sell):
    """
    Compare the optimal policy to a given policy.

    Parameters
    ----------
    total_time_steps : int
        How often the agent may offer the banana
    price_not_sold : float
        How much do we have to pay if we don't sell until
        total_time_steps is over?
    chance_to_sell : function
        A function that takes the price as an input and outputs the
        probabilty that a banana will be sold.
    """
    r = get_optimal_policy(total_time_steps,
                           price_not_sold,
                           chance_to_sell)
    enum_obj = enumerate(zip(r['optimal_prices'], r['values']), start=1)
    for i, (price, value) in enum_obj:
        print("Opt at step {:>2}: price {:>4.2f} has value {:>4.2f} "
              "(chance: {:>4.2f})"
              .format(i, price, value, chance_to_sell(price)))


def get_optimal_policy(total_time_steps,
                       price_not_sold,
                       chance_to_sell=None):
    """
    Get the optimal policy for the Banana environment.

    This means for each time step, calculate what is the smartest price
    to set.

    Parameters
    ----------
    total_time_steps : int
    price_not_sold : float
    chance_to_sell : function, optional

    Returns
    -------
    results : dict
        'optimal_prices' : List of best prices to set at a given time
        'values' : values of the value function at a given step with the
                   optimal policy
    """
    if chance_to_sell is None:
        chance_to_sell = get_chance
    values = [None for i in range(total_time_steps + 1)]
    optimal_prices = [None for i in range(total_time_steps)]

    # punishment if a banana is not sold
    values[total_time_steps] = (price_not_sold - 1)

    for i in range(total_time_steps - 1, -1, -1):
        opt_price = None
        opt_price_value = None
        for price in np.arange(0.0, 2.01, 0.10):
            p_t = chance_to_sell(price)
            reward_sold = (price - 1)
            value = p_t * reward_sold + (1 - p_t) * values[i + 1]
            if (opt_price_value is None) or (opt_price_value < value):
                opt_price_value = value
                opt_price = price
        values[i] = opt_price_value
        optimal_prices[i] = opt_price
    return {'optimal_prices': optimal_prices,
            'values': values}


def get_chance(x):
    """
    Get probability that a banana will be sold at a given price x.

    Parameters
    ----------
    x : float

    Returns
    -------
    chance_to_sell : float
    """
    return (1 + math.exp(1)) / (1. + math.exp(x + 1))


if __name__ == '__main__':
    total_time_steps = 2
    main(total_time_steps=total_time_steps,
         price_not_sold=0.0,
         chance_to_sell=get_chance)

DQN + Policy Extraction

The following DQN agent (implemented with Keras-RL) works for the CartPole-v0 environment, but learns the policy

1: Take action 19 (price= 1.90)
0: Take action 14 (price= 1.40)

for the Banana environment. It goes in the right direction, but it consistently learns that strategy and not the optimal strategy:

Why does the DQN agent not learn the optimal strategy?

Execute with:

$ python dqn.py --env Banana-v0 --steps 50000

Code for dqn.py:

#!/usr/bin/env python

import numpy as np
import gym
import gym_banana

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import EpisodeParameterMemory


def main(env_name, nb_steps):
    # Get the environment and extract the number of actions.
    env = gym.make(env_name)
    np.random.seed(123)
    env.seed(123)

    nb_actions = env.action_space.n
    input_shape = (1,) + env.observation_space.shape
    model = create_nn_model(input_shape, nb_actions)

    # Finally, we configure and compile our agent.
    memory = EpisodeParameterMemory(limit=2000, window_length=1)

    policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1.,
                                  value_min=.1, value_test=.05,
                                  nb_steps=1000000)
    agent = DQNAgent(model=model, nb_actions=nb_actions, policy=policy,
                     memory=memory, nb_steps_warmup=50000,
                     gamma=.99, target_model_update=10000,
                     train_interval=4, delta_clip=1.)
    agent.compile(Adam(lr=.00025), metrics=['mae'])
    agent.fit(env, nb_steps=nb_steps, visualize=False, verbose=1)

    # Get the learned policy and print it
    policy = get_policy(agent, env)
    for remaining_time, action in sorted(policy.items(), reverse=True):
        print("{:>2}: Take action {:>2} (price={:>5.2f})"
              .format(remaining_time, action, 2 / 20. * action))


def create_nn_model(input_shape, nb_actions):
    """
    Create a neural network model which maps the input to actions.

    Parameters
    ----------
    input_shape : tuple of int
    nb_actoins : int

    Returns
    -------
    model : keras Model object
    """
    model = Sequential()
    model.add(Flatten(input_shape=input_shape))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(512, activation='relu'))
    model.add(Dense(nb_actions, activation='linear'))  # important to be linear
    print(model.summary())
    return model


def get_policy(agent, env):
    policy = {}
    for x_in in range(env.TOTAL_TIME_STEPS):
        action = agent.forward(np.array([x_in]))
        policy[x_in] = action
    return policy


def get_parser():
    """Get parser object for script xy.py."""
    from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
    parser = ArgumentParser(description=__doc__,
                            formatter_class=ArgumentDefaultsHelpFormatter)
    parser.add_argument("--env",
                        dest="environment",
                        help="OpenAI Gym environment",
                        metavar="ENVIRONMENT",
                        default="CartPole-v0")
    parser.add_argument("--steps",
                        dest="steps",
                        default=10000,
                        type=int,
                        help="how steps are trained?")
    return parser


if __name__ == "__main__":
    args = get_parser().parse_args()
    main(args.environment, args.steps)

Dennis Soemers · Accepted Answer

If I interpret your code correctly, it looks to me like you're using 50K training steps:

$ python dqn.py --env Banana-v0 --steps 50000

But also have a warmup period of 50K steps by putting the following in the DQNAgent constructor:

nb_steps_warmup=50000

I believe that this means you're actually not doing any training at all, since the warmup period is only used to collect experience in the replay buffer, is that correct? If so, the solution would probably be as simple as reducing the number of warmup steps, or increasing the number of training steps.

For future reference (or in case I'm mistaken in my interpretation of your code above), I'd recommend always creating a learning curve plot (episode rewards on y-axis, training steps on x-axis). That's always useful to get an idea of what's happening, and can help you focus on debugging the important parts of your code. If the rewards aren't increasing at all, you know that it's not learning at all for whatever reason. If they do increase for a while, but then plateau, you can for example try reducing the learning rate. If they do increase and keep increasing all the way until the end, you know it probably simply hasn't converged yet and you can try increasing the number of training steps or increasing the learning rate.

Why can't my DQN agent find the optimal policy in a non-deterministic environment?

Tags:

python

optimization

reinforcement-learning

openai-gym

keras-rl

Optimal policy calculation

DQN + Policy Extraction

Martin Thoma

1 Answers

Dennis Soemers

Recent Activity

Donate For Us

Why can't my DQN agent find the optimal policy in a non-deterministic environment?

Tags:

python

optimization

reinforcement-learning

openai-gym

keras-rl

Optimal policy calculation

DQN + Policy Extraction

Martin Thoma

1 Answers

Dennis Soemers

Related questions

Recent Activity

Donate For Us