Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Q Learning Algorithm for Tic Tac Toe

I could not understand how to update Q values for tic tac toe game. I read all about that but I could not imagine how to do this. I read that Q value is updated end of the game, but I haven't understand that if there is Q value for each action ?

like image 475
bzkrtmurat Avatar asked Jan 19 '15 09:01

bzkrtmurat


2 Answers

You have a Q value for each state-action pair. You update one Q value after every action you perform. More precisely, if applying action a1 from state s1 gets you into state s2 and brings you some reward r, then you update Q(s1, a1) as follows:

Q(s1, a1) = Q(s1, a1) + learning_rate * (r + discount_factor * max Q(s2, _) - Q(s1, a1))

In many games, such as tic-tac-toe you don't get rewards until the end of the game, that's why you have to run the algorithm through several episodes. That's how information about utility of final states is propagated to other states.

like image 128
Tudor Berariu Avatar answered Oct 16 '22 16:10

Tudor Berariu


The problem with the standard Q Learning algorithm is that it just takes too long to propagate the values from the final to the first move because you only know the outcome of the game at the end of it.

Therefore the Q Learning algorithm should be modified. The following paper gives some details on possible modifications:

  1. a non negative reward is given after the game ends (except for draw), then the Q updates is not performed at every action step (which changes nothing), but only after the end of the game
  2. the Q updates is performed by propagating its new value from the last move backward to the first move
  3. another update formula is incorporated that also considers the opponent point of view because of the turn-taking nature of two-player game

Abstract:

This paper reports our experiment on applying Q Learning algorithm for learning to play Tic-tac-toe. The original algorithm is modified by updating the Q value only when the game terminates, propagating the update process from the final move backward to the first move, and incorporating a new update rule. We evaluate the agent performance using full-board and partial-board representations. In this evaluation, the agent plays the tic-tac-toe game against human players. The evaluation results show that the performance of modified Q Learning algorithm with partial-board representation is comparable to that of human players.

Learning to Play Tic-Tac-Toe (2009) by Dwi H. Widyantoro & Yus G. Vembrina

(Unfortunately it is behind a paywall. Either you have access to the IEEE archive or you can ask the authors to provide a copy on researchgate: https://www.researchgate.net/publication/251899151_Learning_to_play_Tic-tac-toe)

like image 45
vonjd Avatar answered Oct 16 '22 14:10

vonjd