|
Pxa(r) not very useful.
Say in state x and:
Pxa(r=1) = 0.5,
Pxa(0) = 0.3,
Pxa(-1) = 0.2,
and:
Pxb(3) = 0.5,
Pxb(-3) = 0.5.
Take action a or b?
How do we decide?
Too much info!
We want to know E(r) for action decision.
And this is what Q-value is
(or, even more useful, E(R)).
Exercise - What is E(r) for each action?
Maybe
Pxa(r) is useful
where there is a large standard deviation:
Pxb(100) = 0.5,
Pxb(-100) = 0.5.
Q. What is E(r)?
E(r) doesn't tell full story here. May want to avoid the -100 at all costs.
[Although solution here, if you really must avoid a certain
punishment, is simply make that punishment -1000
(while keeping positive rewards fixed to less than 100).
The machine will soon learn to avoid it.]
In other words, who is to say that -100 is a big punishment?
It looks big, but if most rewards are +1000, then it is actually
a small punishment.
What matters is not the absolute sizes of r's,
but their relative sizes.
A machine with reward function:
"if good thing r = 1 million, if bad thing r = -1 million,
else r = 0"
will perform no different to a machine with reward function:
"if good thing r = 1, if bad thing r = -1,
else r = 0"
Proof: See following.
Q(x,a) =![]()
c + γ c + γ2 c + ..to all Q-values.
See also Appendix C and Appendix D.
if (good event) 10 else 5is the same as (add -5):
if (good event) 5 else 0same as (multiply by 2/5):
if (good event) 2 else 0same as (add 3):
if (good event) 5 else 3
if (good event) r1 else r2is the same as:
if (good event) (r1-r2) else 0same as (multiply by something positive, r3 > r4):
if (good event) (r3-r4) else 0same as:
if (good event) r3 else r4
So any 2-reward fn can be transformed into any other, provided we don't change the order of the rewards.
if (good event) r1 else if (bad event) r2 else r3is the same as:
if (good) (r1-r3) else if (bad) (r2-r3) else 0i.e. (where (r3-r2) positive):
if (good) (r1-r3) else if (bad) - (r3-r2) else 0same as:
if (good) (r4-r6) else if (bad) - (r3-r2)(r4-r6)/(r1-r3) else 0same as:
if (good) r4 else if (bad) r6 - ((r3-r2)(r4-r6)/(r1-r3)) else r6
Can get 1st and last rewards to anything, but then restricted in what middle reward can be.
Can't transform it into arbitrary r5.
(r3-r2)(r4-r6)/(r1-r3) > 0So:
r6 - ((r3-r2)(r4-r6)/(r1-r3)) < r6
And:
r6 < r4
3-rewards - 2 good things - 2 goals - many ways to balance how you pursue both - a few hits of middle r may compensate for losing one high r - may pursue middle r and ignore higher one
To build a model of the "laws of physics"
we need a huge data structure
x,a,y -> [0,1]
in which 99 % of the entries will be zero.
e.g. Let x = Waterhole is N, a = move N,
y = Waterhole is here.
Pxa(x) = 0.5
Pxa(y) = 0.5
Pxa(w) = 0
for all the millions of w other than x,y
Millions of them because the waterhole is actually only
1 dimension of the entire n-dimensional input state:
Pxa(w,*,*,*,...,*) = 0
e.g. Chess.
x = opening configuration of board
a = move white pawn
Opponent then moves.
Observe new state y
Pxa(y) = 0 for billions of states y
(e.g. all check states, all states without full no. of pieces)
In most worlds, for any given x,a pair, the probability of going to millions of y is zero.
As Q-values improve in accuracy, we need to re-visit updates we already did. We previously updated:
The r was accurate.
Q(y,b) was totally inaccurate, however,
so once we learn an accurate value for Q(y,b),
we should really do this update again.
If we do store a world model, then we can re-do updates in our head, factoring in more accurate Q(y,b)'s.
e.g. In waterhole above,
we have explicitly stored the fact that y,b -> z
So as soon as
maxd Q(z,d) increases
(i.e. as soon as we discover action c),
we can reflect this change in Q(y,b),
without having to revisit y.
And we know that x,a -> y
so we can further update Q(x,a)
without having to revisit x.
Need to be careful about computational cost of this backtracking:
There is no limit to this replay: As Q(x,a) gets more accurate, the states that lead to x can be updated with more accurate next-Q-value estimates, and so on. z itself may eventually lead to x in a big loop, so that gets updated, and then x leads eventually to z, so that gets updated again, and so on. We can replay any finite set of experiences forever, updating perhaps all Q-values in the system many times before whole system finally converges.
"Dreaming"? - "Reasoning"? - Explicit memories.
Things you can do offline: Are these perhaps best done when the brain is not otherwise occupied? i.e. Perhaps best done when asleep?
If we don't store a model, we have to wait until y leads to z again before we update Q-values for y (and wait until x leads to y again, etc.) - Implicit memories (Q-values reflect our past experience implicitly).
Model good for changing worlds/problems. Also good if experiences are costly / rare.
Instead of building model (huge data structure, mostly redundant, i.e. mostly 0's) we could just store whatever experiences we had and then replay.
Replay more complex with probabilistic world
- must be fair.
e.g.
x,a -> y
with probability 0.1,
x,a -> z
with prob. 0.9.
Say you just experience
x,a -> y
once, and then replay it 20 times.
- A model may be more trouble than it's worth.
Model is useful for changing world - easier to spot that dynamics of the world have changed if you have a prediction of what they should have been.
Still tricky to detect because MDP allows probabilistic world.
Model useful when give the agent new goals in the same world: "Go to state z". See Feudal Q-learning.
On Internet since 1987.