So I implemented a transposition table in a Monte Carlo Tree Search algorithm using UCT. This allows for keeping a cumulative reward value for a game state, no matter where, and how many times, it is encountered throughout the tree. This increases the quality of the information gathered on the particular game state.
Alas, I have noticed that this creates certain problem with the exploitation vs. exploration selection phase of UCT. In short, the UCT score assigned to a state, takes into account the ratio between the number of times the parent state was visited and the number of times the child (whom we are calculating the UCT score for) was visited. As you can see from this diagram, when pulling in a state from the transposition table into a newly created branch of the tree, that ratio is all out of whack, with the child state having been visited a great number of times (from elsewhere in the tree) and the parent state having been visited a much smaller amount of times, which should technically be impossible.
So, using a transposition table and persisting the cumulative reward value of a state helps the exploitation part of the algorithm make better choices, but at the same time it skews the exploitation part of the algorithm in a potentially harmful way. Are you aware of any ways to counteract this unintended problem?
Intuitively, I expect you'll want to try the following.
For the exploitation part of UCT, you'll want to store and use the average scores W / V
of child nodes. This average in its entirety can be shared through transpositions. So, in your example, you don't necessarily want to separately share the W = 300
and V = 600
, but instead just share the average score W / V = 300 / 600 = 0.5
. This means that, thanks to transpositions, you can share more accurate estimates of scores (estimates based on larger sample sizes).
For the exploration part of UCT, I think you'll want to use statistics "from the perspective" of the parent node (where you don't have a transposition), rather than the child node (which is a transposition of a node elsewhere in the tree). Instead of using visit counts of child nodes when selecting which child node to go to, this means you'll use visit counts collected per state + action
pair in the parent node.
For example, consider we are in the node where you wrote V: 2, W: 300
(refer to this node as P
), and we have to select a child node. The more standard implementation would loop over the children, and use the child node's visit counts. In your situation, I think it'd be better to instead loop over the actions in the node P
, and keep track of visit counts for those actions instead of for the child nodes. In P
, you'll not yet ever have taken the action leading to the transposition child node, so this visit count for the (P + action)
pair will still be 0.
I do not personally have experience with the combination of MCTS + transpositions though, so you may wish to do additional research to see what others have come up with in the past. There are for example the following papers:
I've implemented the following which worked well with an Alpha Zero style algorithm with Chess:
(moveValueSum + transpositionValueSum) / (moveVisits + transpositionVisits)
.
This way we don't sum across all visits of an action, rather we only average the action value with the single most visited instance of it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With