Optimal epsilon (ϵ-greedy) value

ϵ-greedy policy

I know the Q-learning algorithm should try to balance between exploration and exploitation. Since I'm a beginner in this field, I wanted to implement a simple version of exploration/exploitation behavior.

Optimal epsilon value

My implementation uses the ϵ-greedy policy, but I'm at a loss when it comes to deciding the epsilon value. Should the epsilon be bounded by the number of times the algorithm have visited a given (state, action) pair, or should it be bounded by the number of iterations performed?

My suggestions:

Lower the epsilon value for each time a given (state, action) pair has been encountered.
Lower the epsilon value after a complete iteration has been performed.
Lower the epsilon value for each time we encounter a state s.

Much appreciated!

751

asked Apr 02 '14 08:04

OccamsMan

1 Answers

Although in many simple cases the εk is kept as a fixed number in range 0 and 1, you should know that: Usually, the exploration diminishes over time, so that the policy used asymptotically becomes greedy and therefore (as Qk → Q∗) optimal. This can be achieved by making εk approach 0 as k grows. For instance, an ε -greedy exploration schedule of the form εk = 1/k diminishes to 0 as k → ∞, while still satisfying the second convergence condition of Q-learning, i.e., while allowing infinitely many visits to all the state-action pairs (Singh et al., 2000).

What I do usually is this: set the initial alpha = 1/k (consider the initial k = 1 or 2) after you go trial by trial as k increases the alpha will decrease. it also keeps the convergence guaranteed.

160

answered Sep 21 '22 15:09

NKN

Related questions
                            
                                Canonical way to checksum downloads in a Dockerfile?
                            
                                PKIX Path does not chain with any of the trust anchors error in Windows Environment
                            
                                How to write integration tests for Stripe checkout on Rails?
                            
                                How to cast jQuery $.ajax calls to Bluebird promises without the deferred anit-pattern
                            
                                Nested classes versus compact in Ruby
                            
                                R package build Undocumented code objects
                            
                                and esp, 0xfffffff0
                            
                                how to change the order of factor plot in seaborn
                            
                                knitr: run all chunks in an Rmarkdown document
                            
                                How can I get autolayout to override a UIImageView's intrinsic size?
                            
                                Generate N random integers that sum to M in R
                            
                                A positional parameter cannot be found that accepts argument '\'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimal epsilon (ϵ-greedy) value

Tags:

ϵ-greedy policy

OccamsMan

People also ask

1 Answers

NKN

Recent Activity

Donate For Us