How do we actually code the state-space? In particular how do we encode the Q(x,a) data structure? We want a data structure that can take as input (i.e. be indexed by) two vectors x and a, both members of finite sets, and return a real number output Q(x,a).
The basic idea is that because x and a are members of finite sets, we can give each of them a unique ID number, and so we can give the (x,a) combination a unique ID too. Then we just have an array of some size N of real numbers, indexed by this ID.
x = (x1, x2, .. xn)
where each xi takes one of some finite set of values v0, v1, ... vm (in fact, the set of values, and the number of them, could be different for each dimension i).
e.g. x = (x1, .. x10) where for example x7 takes values -100, 0, or 100.
To help us enumerate the states so we can construct an ID number, we simply constrain the set of values to run from 0 to m (we can translate to/from the real values when we use them). So above we say x7 takes values 0,1,2.
Even with large finite, we may need generalisation.
To reduce the statespace size, we may need to be careful with our definition of state. Note that we can easily define too many possibilities, so that a huge chunk of the logical state space may not actually exist, e.g. all blank in chess board, or all kings. Could even have 90 percent of the lookup table unused.Often we might make the world discrete and finite by dividing up the continuous input into a small number of coarse-grained divisions. But remember this in itself is a generalisation.
Consider how we could enumerate the finite set of states defined as follows:
x = (x1, x2, x3)
where x1 takes values 0,1,
x2 takes values 0,1,2,
and x3 takes values 0,1.
We can sum this up by just having a set of variables saying how many values each dimension takes:
c1=2, c2=3, c3=2.
We can enumerate as follows:
x = (0,0,0) id = 0 0 0 1 1 0 1 0 2 0 1 1 3 0 2 0 4 0 2 1 5 1 0 0 6 1 0 1 7 1 1 0 8 1 1 1 9 1 2 0 10 1 2 1 11Note that it is not binary.
Consider the last state (1,2,1).
The "1" means we have done the "0"'s in the first column, i.e. we have done 1.c2.c3,
plus the "2" means that within the "1"s we have already done an additional 2.c3,
plus the last "1" means we have done an additional 1.
So id(1,2,1) = 1.c2.c3 + 2.c3 + 1
In general, we can enumerate a state as follows:
int id ( state x )
return x1c2c3 + x2c3 + x3
x = (x1, x2, .. xn),
id(x) = x1c2..cn + x2c3..cn + ... + x(n-1)cn + xn
The number of possible states is:
nostates = c1c2...cn
The state IDs run from: 0 .. (nostates-1)
Similar scheme. Action IDs run from: 0 .. (noactions-1)
(x,a) pairs can be enumerated:
(0,0) id = 0 + 0 (0,1) 0 + 1 ... ... (0,noactions-1) 0 + noactions-1 (1,0) 1.noactions + 0 ... ... (1,noactions-1) 1.noactions + noactions-1 (2,0) 2.noactions + 0 ... ... (nostates-1,noactions-1)The ID of an (x,a) pair is:
The number of (x,a) combinations is:
IDs run from:
0 .. (nostates*noactions)-1
So our function to access the State Space is:
real StateSpace :: at ( vector x, vector a )
int id = (id(x) * noactions) + id(a)
return ActualArray [ id ]
where ActualArray is simply an ordinary 1-dimensional array of real numbers:
real ActualArray [ nostates * noactions ]Then we have one of these data structures to hold the Q-values:
and we can just use it like:
Q.at(x,a) = ((1-ALPHA) * Q.at(x,a)) + (ALPHA * new-estimate)
In the sample code you will see that I enumerated Actions x States rather than States x Actions as above.