When to use a certain Reinforcement Learning algorithm?

Tags:

I'm studying Reinforcement Learning and reading Sutton's book for a university course. Beside the classic PD, MC, TD and Q-Learning algorithms, I'm reading about policy gradient methods and genetic algorithms for the resolution of decision problems. I have never had experience before in this topic and I'm having problems understanding when a technique should be preferred over another. I have a few ideas, but I'm not sure about them. Can someone briefly explain or tell me a source where I can find something about typical situation where a certain methods should be used? As far as I understand:

Dynamic Programming and Linear Programming should be used only when the MDP has few actions and states and the model is known, since it's very expensive. But when DP is better than LP?
Monte Carlo methods are used when I don't have the model of the problem but I can generate samples. It does not have bias but has high variance.
Temporal Difference methods should be used when MC methods need too many samples to have low variance. But when should I use TD and when Q-Learning?
Policy Gradient and Genetic algorithms are good for continuous MDPs. But when one is better than the other?

More precisely, I think that to choose a learning methods a programmer should ask himlself the following questions:

does the agent learn online or offline?
can we separate exploring and exploiting phases?
can we perform enough exploration?
is the horizon of the MDP finite or infinite?
are states and actions continuous?

But I don't know how these details of the problem affect the choice of a learning method. I hope that some programmer has already had some experience about RL methods and can help me to better understand their applications.

384

asked Mar 28 '14 21:03

Simon

1 Answers

Briefly:

does the agent learn online or offline? helps you to decide either using on-line or off-line algorithms. (e.g. on-line: SARSA, off-line: Q-learning). On-line methods have more limitations and need more attention to pay.

can we separate exploring and exploiting phases? These two phase are normally in a balance. For example in epsilon-greedy action selection, you use an (epsilon) probability for exploiting and (1-epsilon) probability for exploring. You can separate these two and ask the algorithm just explore first (e.g. choosing random actions) and then exploit. But this situation is possible when you are learning off-line and probably using a model for the dynamics of the system. And it normally means collecting a lot of sample data in advance.

can we perform enough exploration? The level of exploration can be decided depending on the definition of the problem. For example, if you have a simulation model of the problem in memory, then you can explore as you want. But real exploring is limited to amount of resources you have. (e.g. energy, time, ...)

are states and actions continuous? Considering this assumption helps to choose the right approach (algorithm). There are both discrete and continuous algorithms developed for RL. Some of "continuous" algorithms internally discretize the state or action spaces.

180

answered Sep 18 '22 00:09

NKN

Related questions
                            
                                Image recognition and 3d rendering
                            
                                Generating contour lines from regularly spaced data
                            
                                Calculating all possible sub-sequences of a given length (C#)
                            
                                What's the best way to make an animated GIF using an algorithm?
                            
                                Compare three-dimensional structures
                            
                                Floyd-Warshall visualisation suggestions?
                            
                                Searching algorithm
                            
                                How does space partitioning algorithm for finding nearest neighbors work?
                            
                                Suggested GA operators for a TSP problem?
                            
                                Fast two-dimensional pattern matching
                            
                                Generate random points distributed like cities?
                            
                                How is this algorithm, for finding maximum path on a Directed Acyclical Graph, called?
                            
                                Calculate a set of concatenated sets of n sets
                            
                                Sort algorithm with fewest number of operations
                            
                                Data structure and algorithms for a directed cyclic graph (F#)
                            
                                A particular problem with btree insertion
                            
                                Prim's MST algorithm in O(|V|^2)
                            
                                Algorithm to detect similar documents in python script [closed]
                            
                                Finding patterns in list
                            
                                Breadth First Search and Depth First Search

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When to use a certain Reinforcement Learning algorithm?

Tags:

algorithm

artificial-intelligence

machine-learning

reinforcement-learning

markov-chains

Simon

People also ask

1 Answers

NKN

Recent Activity

Donate For Us