I've been trying to build a model using 'Deep Q-Learning' where I have a large number of actions (2908). After some limited success with using standard DQN: (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), I decided to do some more research because I figured the action space was too large to do effective exploration.
I then discovered this paper: https://arxiv.org/pdf/1512.07679.pdf where they use an actor-critic model and policy gradients, which then led me to: https://arxiv.org/pdf/1602.01783.pdf where they use policy gradients to get much better results then DQN overall.
I've found a few sites where they have implemented policy gradients in Keras, https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html and https://oshearesearch.com/index.php/2016/06/14/kerlym-a-deep-reinforcement-learning-toolbox-in-keras/ however I'm confused how they are implemented. In the former (and when I read the papers) it seems like instead of providing an input and output pair for the actor network, you provide the gradients for the all the weights and then use the network to update it, whereas, in the latter they just calculate an input-output pair.
Have I just confused myself? Am I just supposed to be training the network by providing an input-output pair and use the standard 'fit', or do I have to do something special? If it's the latter, how do I do it with the Theano backend? (the examples above use TensorFlow).
Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent.
Deep Deterministic Policy Gradient (DDPG) is a reinforcement learning technique that combines both Q-learning and Policy gradients. DDPG being an actor-critic technique consists of two models: Actor and Critic.
Deep-Q-learning is a value based method while Policy Gradient is a policy based method. It can learn the stochastic policy ( outputs the probabilities for every action ) which is useful for handling the exploration/exploitation trade off. Often π is simpler than V or Q.
Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.
the agent needs a policy that is basically a function that maps a state into a policy that is a probability for each action. So, the agent will choose an action according to its policy.
i.e, policy = f(state)
Policy Gradient does not have a loss function. Instead, it tries to maximize the expected return of rewards. And, we need to compute the gradients of log(action_prob) * advantage
I'm assuming something like this
We need two functions
You already know it's not easy to implement like typical classification problems where you can just model.compile(...) -> model.fit(X, y)
However,
In order to fully utilize Keras, you should be comfortable with defining custom loss functions and gradients. This is basically the same approach the author of the former one took.
You should read more documentations of Keras functional API and keras.backend
Plus, there are many many kinds of policy gradients.
The seemingly conflicting implementations you encountered are both valid implementations. They are two equivalent ways two implement the policy gradients.
In the vanilla implementation, you calculate the gradients of the policy network w.r.t. rewards and directly update the weights in the direction of the gradients. This would require you to do the steps described by Mo K.
The second option is arguably a more convenient implementation for autodiff frameworks like keras/tensorflow. The idea is to implement an input-output (state-action) function like supervised learning, but with a loss function who's gradient is identical to the policy gradient. For a softmax policy, this simply means predicting the 'true action' and multiplying the (cross-entropy) loss with the observed returns/advantage. Aleksis Pirinen has some useful notes about this [1].
The modified loss function for option 2 in Keras looks like this:
import keras.backend as K def policy_gradient_loss(Returns): def modified_crossentropy(action,action_probs): cost = K.categorical_crossentropy(action,action_probs,from_logits=False,axis=1 * Returns) return K.mean(cost) return modified_crossentropy
where 'action' is the true action of the episode (y), action_probs is the predicted probability (y*). This is based on another stackoverflow question [2].
References
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With