Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal epsilon (ϵ-greedy) value

Tags:

ϵ-greedy policy

I know the Q-learning algorithm should try to balance between exploration and exploitation. Since I'm a beginner in this field, I wanted to implement a simple version of exploration/exploitation behavior.

Optimal epsilon value

My implementation uses the ϵ-greedy policy, but I'm at a loss when it comes to deciding the epsilon value. Should the epsilon be bounded by the number of times the algorithm have visited a given (state, action) pair, or should it be bounded by the number of iterations performed?

My suggestions:
  1. Lower the epsilon value for each time a given (state, action) pair has been encountered.
  2. Lower the epsilon value after a complete iteration has been performed.
  3. Lower the epsilon value for each time we encounter a state s.

Much appreciated!

like image 751
OccamsMan Avatar asked Apr 02 '14 08:04

OccamsMan


People also ask

Is Epsilon-greedy optimal?

Hence, the epsilon-greedy action selection policy discovers the optimal actions for sure. As we've discussed above, usually the optimal action, i.e., the action with the highest Q-value is selected. Otherwise, the algorithm explores a random action. An epsilon-greedy algorithm is easy to understand and implement.

What value should Epsilon be?

The EPSILON property has a value of approximately 2.2204460492503130808472633361816E-16 , or 2-52.

What is the ε greedy policy?

Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly. The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring.

What is the advantage of an ε greedy strategy?

The Epsilon-Greedy strategy is an easy way to add exploration to the basic Greedy algorithm. Due to the random sampling of actions, the estimated reward values of all actions will converge on their true values.


1 Answers

Although in many simple cases the εk is kept as a fixed number in range 0 and 1, you should know that: Usually, the exploration diminishes over time, so that the policy used asymptotically becomes greedy and therefore (as Qk → Q∗) optimal. This can be achieved by making εk approach 0 as k grows. For instance, an ε -greedy exploration schedule of the form εk = 1/k diminishes to 0 as k → ∞, while still satisfying the second convergence condition of Q-learning, i.e., while allowing infinitely many visits to all the state-action pairs (Singh et al., 2000).

What I do usually is this: set the initial alpha = 1/k (consider the initial k = 1 or 2) after you go trial by trial as k increases the alpha will decrease. it also keeps the convergence guaranteed.

like image 160
NKN Avatar answered Sep 21 '22 15:09

NKN