We normally store the Q-values in a lookup table.
Question - Why does this not work if x,a contain real-number components?
Answer - After some experience our lookup table would look like this:
x Q(x,a1) Q(x,a2) Q(x,a3) 313.4009 0 [0.244] 0 313.7562 [1.355] 0 0 313.7564 0 [0.155] 0 current input x: 313.3170 0 0 0With real numbered states, every state is new. x has not been visited, so we don't know what to do. Every state x will have precisely one Q-value in precisely one column, since we only visited it once, and tried one action.
NOTE - In fact, technically, a digital computer system will never deal with actual real numbers. There will always be some limit to the no. of decimal places (because there is limit to the no. of bytes the floating-point number is stored in). So technically, a lookup table is possible for any computer/robot system. But could be massive / bigger than memory.
In practice, if large no. of states / large no. of decimal places, get same type of problems as if actually real-valued: Never see same x twice during even a very long experiment, Lookup table huge (if not actually infinite), etc.
So in this situation
instead of using a lookup table to represent our state-space in RL,
we will use a neural network.
Input = (x,a).
Output = Q(x,a).
Need to transfer info from Q(x,a) to "nearby" (x,a) combinations
("nearby" x's, "nearby" a's).
Revision - Designing the Network
In particular, our choice of how to separate the input nodes will decide how effective our network can be in learning the Q-values.
Might seem strange that we have to make these design decisions to get an effective network, but remember that our coarse-grained discrete lookup table is a generalisation itself. It is already a hand-designed initial approximation, deciding to separate directions into 8 compass points.
Consider where x = (f,n,p,i), where f takes values 0..9, n and p also 0..9, i takes values 0,1, and a = 0..8.
No. of possible (x,a) combinations = 18,000.
You might be able to separate the inputs if they were grouped together. e.g. The net could separate Inputs 15000 and greater (i.e. 15000 - 18000) by having a hidden node with weight 1/15000 and threshold 1.
The problem is, the inputs you want to group together
are likely to be scattered all over the 1 - 18000 range.
e.g.
Separate (*,*,9,*,*) from the rest
Separate (*,9,0-7,*,*) from the rest
Remember that when you learn in a neural net, you backprop the error and change all weights, thus interfering with the output for all other inputs. We adjust the weights to make the net better for x. That may make it worse for x'. If inputs lead to different outputs, we need to learn to separate them.
Only when weights or inputs are zero along some connection is there absolutely no interference. Zero weights/inputs are our main trick in helping the net separate the inputs.
Say hidden node has threshold t_{j} = 50,
and only connected to 1 input,
with w_{ij} = 1
(i.e. weight zero on other links).
Then it only fires for the region where input > 50.
If w_{ij} = - 1, threshold t_{j} = -50,
it only fires for input < 50.
In this way different parts of the net fire for different regions
of n-dimensional input space.
But we must give it enough links and enough hidden nodes so it can specialise.
Noting that each component of the input takes only
a small number of discrete values, have 10 + 10 + 10 + 2 + 9
= 41 input nodes, each taking values 0 or 1.
(Every input will have precisely 5 1's and 36 0's)
We can rank all our schemes, from most interference to least (lookup table):
In fact, to separate the inputs further,
because we have only a small number of discrete actions,
we could have one net per action.
We have 9 separate nets.
Each takes a vector input x and produces a floating point output:
Q_{a}(x) = Q(x,a)
Now each net has 10 + 10 + 10 + 2 = 32 input nodes, each 0 or 1.
Q-value = x_{k} - t_{k}
(No need to normalise it with sigmoid. Goes from - infinity to infinity.)