CA425 Practical
Express the 3 x 3
noughts-and-crosses
(or "Tic-tac-toe"
or "X's and O's") game
as a Reinforcement Learning problem.
The program will play against another copy of itself.
That is, at first you will be playing against
a random player.
As you learn, your opponent gets better.
- Define the space of states x.
e.g. Let x = (x1,..,x9) represent the 9 squares.
Each xi takes value:
0 (blank), 1 (me) or 2 (opponent).
Start at x = (0,0,0,0,0,0,0,0,0).
I make first move (one empty square becomes "1").
Then the opponent makes a move into an empty square (becomes "2").
Then I observe the new state.
And can take a new action, and so on.
Note from my viewpoint the world is probabilistic
(same action a in same x can lead to different y
after opponent moves).
Observe x
Take a
(Opponent makes move)
Observe y
- Define the space of actions a.
e.g. Let a = (a1).
a1 takes value 0 to 8 (meaning put "1" into one of the 9 squares).
- How to deal with illegal actions (put "1" into non-empty square).
Various possibilities:
-
When an action is illegal,
you could define it as just leaving you in same state x,
ready to take new action.
Problem: Might prefer to stay in same state x forever if alternative is
opponent wins!
- You could punish the action instead (e.g. lose game)
to make it definitely learn not to take it.
In different states x, different actions a are illegal.
Perhaps too complex to define all these in advance.
We could just learn them all by trial and error.
- Or just run action suggestion through a hand-written checker:
repeat
{
a = getSuggestedAction(x)
}
until ( validAction(x,a) )
- Define an array to hold Q(x,a) values.
See
Coding the state-space as a lookup-table.
See
Sample code for lookup-table Q-learning.
39 states = 19,683 states
We need 19,683 x 9 boxes for Q(x,a) table = 177,147 boxes.
Run
testEnumeration.
- Define the reward function.
r for getting to one of 3 classes of lines.
3 horizontal lines or 3 vertical lines or 2 diagonals.
Rather than define in advance all possible states y where can get r,
might be easier to just write a function "isGoal(state)"
to check if any given state y is goal.
Punishment if opponent wins.
Zero if draw.
- Set it up so you can do a run (until end game, maximum of 9 steps)
and can plug in any learning algorithm to modify the Q-values.
- Start by modifying the Q-values as follows.
When you get a reward or punishment:
Q(x,a) := r
- Learn playing against itself.
- Demo it playing a random opponent.
Show how well it plays.
The above is for 90 percent. For the last 10 percent do:
- First, modify it to deal with the fact that sometimes you get a reward / punishment,
sometimes you don't (because opponent is unpredictable).
Build up a running average of all feedback:
Q(x,a) := (1-α) Q(x,a)
+ α r
where α
goes 1, 1/2, 1/3, ...
i.e.
α
is 1/n,
where n is the number of times you have updated this particular pair Q(x,a).
i.e. You need to keep another array n(x,a) to remember all these.
- Then modify it to look into the future, so we can move towards a goal state
from far away:
Q(x,a) := (1-α) Q(x,a) +
α
( r + γ c )
where c = best Q-value in next state
- You could save Q-values to file, and read from file on startup,
so don't start blank each time program is run.
To hand up:
- Write it in any language.
-
Hand up a commented printout of your program.
Colour, syntax-highlighted
printout in landscape mode
(the best way to print code.)
- Your program on CD.
- Hand up a report showing what your program does, with output and screen shots.
- Give it direct to me or leave it for me in the school office.