MCTS UCT with a scoring system

Tags:

montecarlo

I'm trying to solve a variant of 2048 by a Monte-Carlo Tree Search. I found that UCT could a good way to have some trade-off between exploration/exploitation.

My only issue is that all the versions I've seen assume that the score is a win percentage. How can I adapt it to a game where the score is the value of the board at the last state, and thus going from 1-MAX and not a win.

score formula

I could normalize the score using the constant c by dividing by MAX but then it would overweight exploration at early stage of the game (since you get bad average score) and overweight exploitation at late stage of the game.

750

asked Apr 16 '16 13:04

Atol

1 Answers

Indeed most of the literature assumes your games are either lost or won and award a score of 0 or 1, which will turn into a win ratio when averaged over the number of games played. Then exploration parameter C is usually set to sqrt(2) which is optimal for the UCB in bandit problems.

To find out what a good C is in general you have to step back a bit and see what the UCT is really doing. If one node in your tree had an exceptionally bad score in the one rollout it had then exploitation says you should never choose it again. But you've only played that node once, so it might have just been bad luck. To acknowledge this you give that node a bonus. How much? Enough to make it a viable choice even if its average score is the lowest possible and some other node has the highest average score possible. Because with enough plays it might turn out that the one rollout your bad node had was indeed a fluke, and the node actually turns out to be pretty reliable with good scores. Of course, if you get more bad scores then it will likely not be bad luck so it won't deserve more rollouts.

So with scores ranging from 0 to 1 a C of sqrt(2) is a good value. If your game has a maximum achievable score then you can normalize your scores by dividing by the max and force your scores into to 0-1 range to suit a C of sqrt(2). Alternatively you don't normalize the scores but multiply C by your maximum score. The effect is the same: the UCT exploration bonus is large enough to give your underdog nodes some rollouts and a chance to prove themselves.

There is an alternative way of setting C dynamically that has given me good results. As you play, you keep track of the highest and lowest scores you've ever seen in each node (and subtree). This is the range of scores possible and this gives you a hint of how big C should be in order to give not well explored underdog nodes a fair chance. Every time i descend into the tree and pick a new root i adjust C to be sqrt(2) * score range for the new root. In addition, as rollouts complete and their scores turn out the be a new highest or lowest score i adjust C in the same way. By continually adjusting C this way as you play but also as you pick a new root you keep C as large as it needs to be to converge but as small as it can be to converge fast. Note that the minimum score is as important as the max: if every rollout will yield at minimum a certain score then C won't need to overcome it. Only the difference between max and min matters.

162

answered Sep 30 '22 08:09

Tubeliar

Related questions
                            
                                best-first Vs. breadth-first
                            
                                Sudoku solving algorithm C++
                            
                                Order Crossover (OX) - genetic algorithm
                            
                                Is enemy / bot A.I. part of the model or controller in an MVC game
                            
                                Determining which inputs to weigh in an evolutionary algorithm
                            
                                How to implement AO* algorithm?
                            
                                Artificial intelligence libraries [closed]
                            
                                How to get attribute list from fitted model in Scikit-learn?
                            
                                Markov Model descision process in Java
                            
                                Does Model-View-Controller Play Nicely with Artificial Intelligence and Behavior Trees?
                            
                                iOS Gesture recognition utilizing accelerometer (and gyroscope)
                            
                                sequence mining for time and product prediction
                            
                                Android shoot-em-up game. Robust enemy patterns for complex group behaviour
                            
                                How do I use Drools Planner?
                            
                                Making a cryptaritmetic solver in C++
                            
                                Encog - How to load training data for Neural Network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With