I know the Q-learning algorithm should try to balance between exploration and exploitation. Since I'm a beginner in this field, I wanted to implement a simple version of exploration/exploitation behavior.
Optimal epsilon valueMy implementation uses the ϵ-greedy policy, but I'm at a loss when it comes to deciding the epsilon value. Should the epsilon be bounded by the number of times the algorithm have visited a given (state, action) pair, or should it be bounded by the number of iterations performed?
My suggestions:Much appreciated!
Hence, the epsilon-greedy action selection policy discovers the optimal actions for sure. As we've discussed above, usually the optimal action, i.e., the action with the highest Q-value is selected. Otherwise, the algorithm explores a random action. An epsilon-greedy algorithm is easy to understand and implement.
The EPSILON property has a value of approximately 2.2204460492503130808472633361816E-16 , or 2-52.
Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly. The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring.
The Epsilon-Greedy strategy is an easy way to add exploration to the basic Greedy algorithm. Due to the random sampling of actions, the estimated reward values of all actions will converge on their true values.
Although in many simple cases the εk is kept as a fixed number in range 0 and 1, you should know that: Usually, the exploration diminishes over time, so that the policy used asymptotically becomes greedy and therefore (as Qk → Q∗) optimal. This can be achieved by making εk approach 0 as k grows. For instance, an ε -greedy exploration schedule of the form εk = 1/k diminishes to 0 as k → ∞, while still satisfying the second convergence condition of Q-learning, i.e., while allowing infinitely many visits to all the state-action pairs (Singh et al., 2000).
What I do usually is this: set the initial alpha = 1/k (consider the initial k = 1 or 2) after you go trial by trial as k increases the alpha will decrease. it also keeps the convergence guaranteed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With