Alpha and Gamma parameters in QLearning

Tags:

What difference to the algorithm does it make having a big or small gamma value? In my optic, as long as it is neither 0 or 1, it should work exactly the same. On the other side, whatever gamma I choose, it seems the Qvalues get pretty close to zero really quickly(I'm having here values on the order of 10^-300 just in a quick test). How do usually people plot Qvalues (i'm plotting a (x, y, best QValue for that state) given that problem? I'm trying to get around with logarithms but even then it feels kinda awkward.

Also, I don't get what is the reason behind having and alpha parameter in the Q Learning update function. It basically sets the magnitude of the update we are going to make to the Q value function. I have the idea that it is usually decreased over time. What is the interest in having it decrease over time? An update value in the beginning should have more importance than 1000 episodes later?

Also, I was thinking that a good idea for exploring the state space every time the agent doesn't want to do the greedy action would be to explore any state that still has a zero QValue(this means, at least most of the times, a state never before done), but I don't see that referred in any literature. Are there any downsides to this? I know this can't be used with (at least some) generalization functions.

Other idea would be to keep a table of visited states/actions, and try to do the actions that were tried less times before in that state. Of course this can only be done in relatively small state spaces(in my case it is definitely possible).

A third idea for late in the exploration process would be to look not only to the selected action looking for the best qvalues but also look inside all those actions possible and that state, and then in the others of that state and so.

I know those questions are kinda unrelated but I'd like to hear the opinions of people that have worked before with this and (probably) struggled with some of them too.

300

asked Dec 06 '09 07:12

devoured elysium

2 Answers

From a reinforcement leaning masters candidate:

Alpha is the learning rate. If the reward or transition function is stochastic (random), then alpha should change over time, approaching zero at infinity. This has to do with approximating the expected outcome of a inner product (T(transition)*R(reward)), when one of the two, or both, have random behavior.

That fact is important to note.

Gamma is the value of future reward. It can affect learning quite a bit, and can be a dynamic or static value. If it is equal to one, the agent values future reward JUST AS MUCH as current reward. This means, in ten actions, if an agent does something good this is JUST AS VALUABLE as doing this action directly. So learning doesn't work at that well at high gamma values.

Conversely, a gamma of zero will cause the agent to only value immediate rewards, which only works with very detailed reward functions.

Also - as for exploration behavior... there is actually TONS of literature on this. All of your ideas have, 100%, been tried. I would recommend a more detailed search, and to even start googling Decision Theory and "Policy Improvement".

Just adding a note on Alpha: Imagine you have a reward function that spits out 1, or zero, for a certain state action combo SA. Now every time you execute SA, you will get 1, or 0. If you keep alpha as 1, you will get Q-values of 1, or zero. If it's 0.5, you will get values of +0.5, or 0, and the function will always oscillate between the two values for ever. However, if everytime you decrease your alpha by 50 percent, you get values like this. (assuming reward is recieved 1,0,1,0,...). Your Q-values will end up being, 1,0.5,0.75,0.9,0.8,.... And will eventually converge kind of close to 0.5. At infinity it will be 0.5, which is the expected reward in a probabilistic sense.

answered Sep 18 '22 23:09

user1949902

What difference to the algorithm does it make having a big or small gamma value?

gammas should correspond to the size of observation space: you should use larger gammas (ie closer to 1) for big state spaces, and smaller gammas for smaller spaces.

one way to think about gamma is it represents the decay rate of a reward from the final, successful state.

answered Sep 22 '22 23:09

mynameisvinn

Related questions
                            
                                Why compilers don't translate in simpler languages?
                            
                                Why do we use the 'virtual' keyword (etymologically)? [closed]
                            
                                Are event handler, event listener, and event registration all referring to the same thing?
                            
                                Uniform HTML templating language
                            
                                Find if an array is a sequence in O(n) time and O(1) space [duplicate]
                            
                                Techniques for adding Achievements to business class software
                            
                                Cultural coding differences [closed]
                            
                                Benchmarking: When can I stop making measurements?
                            
                                Is there an API or tool that can automate software updating?
                            
                                Interview: on People Matching
                            
                                Detecting if internet connection is busy
                            
                                Informal fallacy causes stack overflow
                            
                                What is the inverse of a promise?
                            
                                Applications of Longest Increasing Subsquence
                            
                                How do i test/refactor my tests?
                            
                                Bouncing Bubble Algorithm for smallest enclosing sphere
                            
                                What is the difference between RSS and heap?
                            
                                Is there a functional programming idiom for "pick from beginning of a list and reduce until the result satisfies a predicate"?
                            
                                Detecting misspelled words
                            
                                How do you come up with your app's minimum hardware specs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Alpha and Gamma parameters in QLearning

Tags:

language-agnostic

artificial-intelligence

reinforcement-learning

devoured elysium

People also ask

2 Answers

user1949902

mynameisvinn

Recent Activity

Donate For Us