Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Monte Carlo Tree Search, Backpropagation (Backup) step: Why change perspective of reward value?

I've been reading through the Monte Carlo Tree Search survey paper by Browne et. al:

http://ccg.doc.gold.ac.uk/papers/browne_tciaig12_1.pdf

"A Survey of Monte Carlo Tree Search Methods"

I'm wrestling with just one piece of the pseudocode on p. 9. My question occurs in a similar form in both the Backup and BackupNegamax functions.

Suppose that I'm player 1 in a 2-player zero-sum game. (So, using the BackupNegamax function.) It's my turn to move, and I'm using MCTS to choose my move. In BackupNegamax, why is the delta value negated as you back up the tree? I understand that in a two-player zero-sum game, if the reward is delta for player 1 (me), then it's -delta for player 2. But shouldn't the entire tree be from player 1's perspective? (This would then be similar to how nodes are rated in a minimax tree, if I'm not mistaken.)

If the perspective of the Q value switches back and forth depending on what level of the tree you're on, wouldn't that mess up the calculation shown in the BestChild function? Specifically, suppose some node v has a very high Q value, because it has often led to high rewards for player 1. The given pseudocode seems to suggest that v's parent, which I'll call u, would likely have a very low (very negative) Q value (of course u's Q value would also account for its other children's Q values.)

So it doesn't make sense to me that u (the parent) would have a very low Q value while v (the child) has a very high one. I know v is from player 1's perspective in the pseudocode, and u is from player 2's perspective, but my question is why. Why aren't both node's Q values stored from player 1's perspective? That way both u and v would have high Q values, and thus high exploitation ratings, and they'd both be considered valuable for further exploitation according to the BestChild function.

(I'm coming at MCTS from experience with minimax, and in minimax the entire tree is from Max's perspective, so that's why I'm struggling with the different idea here.)

My question also applies to Backup - why is each Q value updated according to the perspective of the player at that level of the tree, instead of everything being updated from "my" perspective?

I hope I've been clear in my question. Thank you very much for your help!

like image 250
Bob Smith Avatar asked May 28 '15 14:05

Bob Smith


1 Answers

There are two ways to describe this mechanism:

  1. Globally: From the perspective of the root player, in which case the playout values at every second ply are negated, as the opponent is acting against the root player.

  2. Locally: From the perspective of the player who has just moved at each ply, in which case the playout value is not negated, as each player tries to maximise their own reward.

The standard formulation uses option 1, as it's easier to describe, and has its basis in two-player combinatorial games. However, I tend to use the second formulation in my actual implementations, as it is more flexible; it handles games with more than two players, less than two players, variable move order, multi-part moves, cooperative goals, etc.

This just confirms what is said in the other answers.

like image 189
Cameron Browne Avatar answered Nov 15 '22 07:11

Cameron Browne