I'm trying to find optimal policy in environment with continuous states (dim. = 20) and discrete actions (3 possible actions). And there is a specific moment: for optimal policy one action (call it "action 0") should be chosen much more frequently than other two (~100 times more often; this two action more risky).
I've tried Q-learning with NN value-function approximation. Results were rather bad: NN learns always choose "action 0". I think that policy gradient methods (on NN weights) may help, but don't understand how to use them on discrete actions.
Could you give some advise what to try? (maybe algorithms, papers to read). What are the state-of-the-art RL algorithms when state space is continuous and action space is discrete?
Thanks.
The state-space for the Cart Pole problem is obviously continuous since the pole can lean to any degree.
As we mentioned above, in continuous space setting, value function is a representation of states ( V = V(S, w) ) and most machine learning algorithms can be applied here.
The CartPole problem is also known as the “Inverted Pendulum” problem. It has a pole attached to a cart. Since the pole's center of mass is above its pivot point, it's an unstable system. The official full description can be found here on the Open-AI website.
Applying Q-learning in continuous (states and/or actions) spaces is not a trivial task. This is especially true when trying to combine Q-learning with a global function approximator such as a NN (I understand that you refer to the common multilayer perceptron and the backpropagation algorithm). You can read more in the Rich Sutton's page. A better (or at least more easy) solution is to use local approximators such as for example Radial Basis Function networks (there is a good explanation of why in Section 4.1 of this paper).
On the other hand, the dimensionality of your state space maybe is too high to use local approximators. Thus, my recommendation is to use other algorithms instead of Q-learning. A very competitive algorithm for continuous states and discrete actions is Fitted Q Iteration, which usually is combined with tree methods to approximate the Q-function.
Finally, a common practice when the number of actions is low, as in your case, it is to use an independent approximator for each action, i.e., instead of a unique approximator that takes as input the state-action pair and return a Q value, using three approximators, one per action, that take as input only the state. You can find an example of this in Example 3.1 of the book Reinforcement Learning and Dynamic Programming Using Function Approximators
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With