|
do forever observe x suggest a according to some decision-algorithm execute a observe y receive reward r(y) or r(x,y) modify decision-algorithm repeat
a can be "no-op".
r can be 0 every step when nothing happens.
Start in state x:
In state x,
taking action b gives a guaranteed reward of 1.5.
Taking action a normally leads to a reward of only 1,
but sometimes leads to a reward of 5.
Pxa(y) = 0.8
Pxa(z) = 0.2
Pxa(w) = 0 for all w other than y,z
(probabilistic transition)
Pxb(s) = 1
Pxb(w) = 0 for all w other than s
(deterministic transition)
x = ( direction )
Consider:
P ( xt+1=bump |
xt=st, at=bt,
xt-1=st-1, at-1=bt-1,
... )
P ( bump | wall is N, move N, wall is N, move N, wall is N, move N )
is higher than:
P ( bump | wall is N, move N, wall not visible, move N, wall not visible, move N )
i.e. There isn't simply a constant:
P ( bump | wall is N, move N )
Q. What simple change would turn the above into a MDP (in fact, a deterministic MDP)?
Consider:
Observe x Take a (Opponent makes non-deterministic move) Observe y
In the above,
How do we decide which action to take?
Look at expected reward:
Ea(r) = 0.2 (5) + 0.8 (1)
= 1.8
Eb(r) = 1 (1.5)
= 1.5
Action a is better.
What do the probabilities mean?
They mean that if you take a 100 times,
you expect 20 5's and 80 1's.
Total reward 180.
If you take b 100 times
you expect 100 1.5's.
Total reward 150.
Maximise E(r).
You are in state x = (1,5).
Imagine that if you take some action a,
you go to state y with
probabilities:
You can calculate E(r).y Pxa(y) reward (1,8) 0.4 10 (0,8) 0.3 5 (1,5) 0.2 1 (0,5) 0.1 0
But "E(y)" is meaningless.
Consider the sum of the y's times their probabilities
= (1,8) 0.4 + (0,8) 0.3 + (1,5) 0.2 + (0,5) 0.1
= (0.6, 7.1)
You are never in this state when you take a.
In fact, this state doesn't exist.
"0.6" doesn't have a meaning.
The state space is not a continuous space. It is a discrete space of points.
Typically, we don't reward for the correct action, r = r(x,a):
in state x if took action b then reward 1 else reward 0because we don't know the correct action!
in state x if got to state z then reward 1 else reward 0
e.g. In the following:
Taking action a in state x,
P(r=5) = 0.2
P(r=0) = 0.4
P(r=3) = 0.3 + 0.1
= 0.4
Question - What is E(r)?
Reward on transition to good state only: r = r(x,y).
e.g. x = not at waterhole, y = at waterhole.
Problem - may leave state to come back in!
See Discussion.
Reward for arriving and staying in good state: r = r(y).
a a1 a2 a3 Q(x,a) 0 5.1 5Deterministic policy - When in x, always take action a2.
Stochastic policy - When in x, 80 % of time take a2, 20 % of time take a3.
Typically - Start stochastic. Become more deterministic as we learn.
Not just maximise immediate E(r) (i.e. deal with probability).
Maximise long-term sum of expected rewards. This may involve taking a short-term loss for long-term gain (equivalent to allowing a move downhill in hill-climbing).
0 ≤ γ < 1
(e.g. γ = 0.9)
If rewards can be positive or negative, R may be a sum of positive and negative terms.
Special case:
γ = 0
Doesn't work:
γ = 1
(see later)
The interesting situation is in the middle (0 < γ < 1)
r in the future can even override immediate punishment
If we decide what to do based on which path has highest R, then [punishment now, r in the future] can beat [reward now, but not in future] if:
If maximum reward is rmax,
what is maximum possible value of R?
The most it could be would be to get the maximum reward every step:

What is this quantity on the right? Let:
Multiply by γ:
Subtract the 2 equations:
And we get a value for X:
That is:
See Appendix A and Appendix B.
This is just the Negative Binomial Series in the Binomial Theorem.
In the real world, we may not know how to get back to state x
to try another action:
x = "I am at party"
a = "I say chat-up line c1"
result - punishment
I might want to say "Go back to x, try c2" but I may not know how to get back to x. All I can do is try out actions in the states in which I happen to find myself. I have to wait until state x happens to present itself again.
Q-learning can work even when states and actions are experienced haphazardly.
Q-learning is for when you don't know Pxa(y) and Pxa(r).
But one approach is just:
Interact with world large number of times
to build up a table of estimates for
Pxa(y)
and
Pxa(r).
Then just do offline DP.
We keep Q-values Q(x,a) for each (x,a) pair.
We tend to take the action with the highest Q-value:
The Q-value will express the discounted reward above:
Can be built up recursively:
Using this update rule:
the Q-value will bounce back and forth:
Q(x,a) := 5
Q(x,a) := 10
Q(x,a) := 6
...
More sensible to build up an average:
Q(x,a) := 5
Q(x,a) := 1/2 (5 + 10)
Q(x,a) := 1/3 (5 + 10 + 6)
...
See Appendix A for how to build up a running average like this without needing infinite memory.
In general, we use the Notation of updating the running average D "in the direction of" the latest sample d, rather than assigning directly to it:
If d < D then the running average D goes down. If d > D then the running average D goes up.
So the Q-value is updated in the direction of the current sample:
Theorem B.1 - If d bounded, then D bounded.
α must start at 1 to get rid of Dinit
Thereafter α can be anything from 0 to 1.
D := 0 + 1.d (since α starts at 1)
D := -10 d + 0
D := -10 (-10 d) + 0
D := - 103 d
D := 104 d
D := - 105 d
....
D not bounded
i.e. If take this action, you will get immediate reward rmax
and you can get next reward rmax
(it is the best you can get in next state, other actions may be worse)
and you can get next reward rmax ..
....
i.e. If take this action,
you will get immediate reward rmin
and you have to get next reward rmin
(all actions in next state lead to
the worst reward rmin,
otherwise the Q-value would be higher)
and you have to get next reward rmin ..
....
i.e. You are trapped in an attractor
from which you cannot escape.
You cannot ever again get out and go to a state
with rewards greater than rmin (and such states must exist).
Not necessarily a single absorbing state,
but a family of states outside of which you can never leave.
Note if expected reward E(r) = rmin, then you will definitely get rmin on this action (r deterministic, though y can still be probabilistic).
If start at:
α = 1/t
then initial Q-values bias our Q-values for some time.
And since we only run for finite time in any finite experiment,
the bias may still be there after learning.
Consider being "born" with Q-values already filled in (i.e. in DNA) and then start learning:
a1 a2
Q(x,a) 0 0
good Q-values to be born with:
a1 a2
Q(x,a) -1000 0
- even if experiment in childhood,
with a moderate temperature
Boltzmann control policy,
still unlikely to try a1.
On Internet since 1987.