Q Learning Applied To a Two Player Game

Tags:

I am trying to implement a Q Learning agent to learn an optimal policy for playing against a random agent in a game of Tic Tac Toe.

I have created a plan that I believe will work. There is just one part that I cannot get my head around. And this comes from the fact that there are two players within the environment.

Now, a Q Learning agent should act upon the current state, s, the action taken given some policy, a, the successive state given the action, s', and any reward received from that successive state, r.

Lets put this into a tuple (s, a, r, s')

Now usually an agent will act upon every state it finds itself encountered in given an action, and use the Q Learning equation to update the value of the previous state.

However, as Tic Tac Toe has two players, we can partition the set of states into two. One set of states can be those where it is the learning agents turn to act. The other set of states can be where it is the opponents turn to act.

So, do we need to partition the states into two? Or does the learning agent need to update every single state that is accessed within the game?

I feel as though it should probably be the latter, as this might affect updating Q Values for when the opponent wins the game.

Any help with this would be great, as there does not seem to be anything online that helps with my predicament.

224

asked Mar 23 '18 13:03

Frederick

1 Answers

In general, directly applying Q-learning to a two-player game (or other kind of multi-agent environment) isn't likely to lead to very good results if you assume that the opponent can also learn. However, you specifically mentioned

for playing against a random agent

and that means it actually can work, because this means the opponent isn't learning / changing its behaviour, so you can reliably treat the opponent as ''a part of the environment''.

Doing exactly that will also likely be the best approach you can take. Treating the opponent (and his actions) as a part of the environment means that you should basically just completely ignore all of the states in which the opponent is to move. Whenever your agent takes an action, you should also immediately generate an action for the opponent, and only then take the resulting state as the next state.

So, in the tuple (s, a, r, s'), we have:

s = state in which your agent is to move
a = action executed by your agent
r = one-step reward
s' = next state in which your agent is to move again

The state in which the opponent is to move, and the action they took, do not appear at all. They should simply be treated as unobservable, nondeterministic parts of the environment. From the point of view of your algorithm, there are no other states in between s and s', in which there is an opponent that can take actions. From the point of view of your algorithm, the environment is simply nondeterministic, which means that taking action a in state s will sometimes randomly lead to s', but maybe also sometimes randomly to a different state s''.

Note that this will only work precisely because you wrote that the opponent is a random agent (or, more importantly, a non-learning agent with a fixed policy). As soon as the opponent also gains the ability to learn, this will break down completely, and you'd have to move on to proper multi-agent versions of Reinforcement Learning algorithms.

130

answered Oct 24 '22 05:10

Dennis Soemers

Related questions
                            
                                How to change the active chat in web whatsapp via selenium or javascript with Python 3
                            
                                Re-creating a python invocation
                            
                                Use python libraries in React Native
                            
                                When building Python with profile guided optimization do I have to leave the computer alone?
                            
                                Python memory consumption of objects and process
                            
                                z3: solve the Eight Queens puzzle
                            
                                GIS/Gdal/OSGeos Import error in django on Windows
                            
                                Why do I need to initialize variables in TensorFlow?
                            
                                Wrapping text not working in matplotlib
                            
                                force Django tests to write models into database
                            
                                How to on Import PEP8 the Package
                            
                                discord.py embed with locally saved images
                            
                                How to install a wheel-style package using setup.py
                            
                                Keras : Why does Sequential and Model give different outputs?
                            
                                Odd TypeError from the airflow scheduler -- has usage of @once for scheduler interval changed in v1.9?
                            
                                How do I copy the contents of a word document?
                            
                                How to get stdout and stderr from a tmux session?
                            
                                Sort python dictionary keys based on sub-dictionary keys by defining sorting order
                            
                                Converting Tensor to np.array using K.eval() in Keras returns InvalidArgumentError
                            
                                Time complexity of min, max on sets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Q Learning Applied To a Two Player Game

Tags:

python

reinforcement-learning

q-learning

tic-tac-toe

Frederick

People also ask

1 Answers

Dennis Soemers

Recent Activity

Donate For Us