I am using rlglue based python-rl framework for q-learning. My understanding is that over number of episodes, the algorithm converges to an optimal policy (which is a mapping which says what action to take in what state).
Question1: Does this mean that after a number of episodes ( say 1000 or more ) I should essentially get the same state:action mapping?
When I plot the rewards (or rewards averaged over 100 episodes) I get a graph similar to Fig 6.13 in this link.
Question2: If the algorithm has converged to some policy why does the rewards fall down? Is there a possibility that the rewards vary drastically?
Question3: Is there some standard method which I can use to compare the results of various RL algorithms?
Q1: It will converge to a single mapping, unless more than one mapping is optimal.
Q2: Q-Learning has an exploration parameter that determines how often it takes random, potentially sub-optimal moves. Rewards will fluctuate as long as this parameter is non-zero.
Q3: Reward graphs, as in the link you provided. Check http://rl-community.org.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With