I'm studying Reinforcement Learning and reading Sutton's book for a university course. Beside the classic PD, MC, TD and Q-Learning algorithms, I'm reading about policy gradient methods and genetic algorithms for the resolution of decision problems. I have never had experience before in this topic and I'm having problems understanding when a technique should be preferred over another. I have a few ideas, but I'm not sure about them. Can someone briefly explain or tell me a source where I can find something about typical situation where a certain methods should be used? As far as I understand:
More precisely, I think that to choose a learning methods a programmer should ask himlself the following questions:
But I don't know how these details of the problem affect the choice of a learning method. I hope that some programmer has already had some experience about RL methods and can help me to better understand their applications.
RL is a perfect fit for problems that require sequential decision-making – that is, a series of decisions that all affect one another. If you are developing an AI program to win at a game, it is not enough for the algorithm to make one good decision; it must make a whole sequence of good decisions.
Reinforcement Learning approaches are used in the field of Game Optimization and simulating synthetic environments for game creation. Reinforcement Learning also finds application in self-driving cars to train an agent for optimizing trajectories and dynamically planning the most efficient path.
Three methods for reinforcement learning are 1) Value-based 2) Policy-based and Model based learning. Agent, State, Reward, Environment, Value function Model of the environment, Model based methods, are some important terms using in RL learning method.
Briefly:
does the agent learn online or offline? helps you to decide either using on-line or off-line algorithms. (e.g. on-line: SARSA, off-line: Q-learning). On-line methods have more limitations and need more attention to pay.
can we separate exploring and exploiting phases? These two phase are normally in a balance. For example in epsilon-greedy action selection, you use an (epsilon) probability for exploiting and (1-epsilon) probability for exploring. You can separate these two and ask the algorithm just explore first (e.g. choosing random actions) and then exploit. But this situation is possible when you are learning off-line and probably using a model for the dynamics of the system. And it normally means collecting a lot of sample data in advance.
can we perform enough exploration? The level of exploration can be decided depending on the definition of the problem. For example, if you have a simulation model of the problem in memory, then you can explore as you want. But real exploring is limited to amount of resources you have. (e.g. energy, time, ...)
are states and actions continuous? Considering this assumption helps to choose the right approach (algorithm). There are both discrete and continuous algorithms developed for RL. Some of "continuous" algorithms internally discretize the state or action spaces.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With