Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org


Building a world model


2.1.7 Notes on building a world model

x,a,r -> [0,1]
could be hard map to build.
What if r comes from the world (e.g. stock prices), rather than a finite pre-defined list of r's.
r-space could be huge / hard to define / hard to define limits
Even within limits, what if r is real number.

Pxa(r) is too much info

Pxa(r) not very useful. Say in state x and:
Pxa(r=1) = 0.5, Pxa(0) = 0.3, Pxa(-1) = 0.2,
and:
Pxb(3) = 0.5, Pxb(-3) = 0.5.
Take action a or b? How do we decide? Too much info! We want to know E(r) for action decision.
And this is what Q-value is (or, even more useful, E(R)).
Exercise - What is E(r) for each action?



E(r) loses information

Maybe Pxa(r) is useful where there is a large standard deviation:
Pxb(100) = 0.5, Pxb(-100) = 0.5.

Q. What is E(r)?

E(r) doesn't tell full story here. May want to avoid the -100 at all costs.

[Although solution here, if you really must avoid a certain punishment, is simply make that punishment -1000 (while keeping positive rewards fixed to less than 100). The machine will soon learn to avoid it.]
In other words, who is to say that -100 is a big punishment? It looks big, but if most rewards are +1000, then it is actually a small punishment. What matters is not the absolute sizes of r's, but their relative sizes.



Is -100 a big punishment?

A machine with reward function: "if good thing r = 1 million, if bad thing r = -1 million, else r = 0" will perform no different to a machine with reward function: "if good thing r = 1, if bad thing r = -1, else r = 0"
Proof: See following.



Multiply r's by constant, or Add constant to all r's

Q(x,a) =

  1. If multiply all rewards by some constant c > 0, then multiply all Q-values by c.
    Same policy (if c > 0).
    If c < 0, it will change the order of Q-values. Also it will change Q-values themselves, since future policy will be changed, since good r's are now bad r's (remember Q(x,a) = reward for a, plus the best you can do thereafter).
    If c=0, random policy.

  2. If add some constant c to all rewards, then add:
    c + γ c + γ2 c + ..
    to all Q-values.
    Same policy, whether c positive or negative.

See also Appendix C and Appendix D.



2-reward reward functions

 if (good event) 10 else 5
is the same as (add -5):
 if (good event) 5 else 0
same as (multiply by 2/5):
 if (good event) 2 else 0
same as (add 3):
 if (good event) 5 else 3


In general:

 if (good event) r1 else r2
is the same as:
 if (good event) (r1-r2) else 0
same as (multiply by something positive, r3 > r4):
 if (good event) (r3-r4) else 0
same as:
 if (good event) r3 else r4

So any 2-reward fn can be transformed into any other, provided we don't change the order of the rewards.



3-reward (or more) reward functions

This does not hold with 3-reward (or more) fns.
Any 3-reward fn can not be transformed into any other.
 if (good event) r1 else if (bad event) r2 else r3
is the same as:
 if (good) (r1-r3) else if (bad) (r2-r3) else 0
i.e. (where (r3-r2) positive):
 if (good) (r1-r3) else if (bad) - (r3-r2) else 0
same as:
 if (good) (r4-r6) else if (bad) - (r3-r2)(r4-r6)/(r1-r3) else 0
same as:
 if (good) r4 else if (bad) r6 - ((r3-r2)(r4-r6)/(r1-r3))  else r6

Can get 1st and last rewards to anything, but then restricted in what middle reward can be.
Can't transform it into arbitrary r5.


Note that the "bad" reward is indeed less than both r4 and r6:

Note the 3 components here are positive, so:
  (r3-r2)(r4-r6)/(r1-r3) > 0
So:
  r6 - ((r3-r2)(r4-r6)/(r1-r3)) < r6

And:

  r6 < r4

Summary

2-rewards - only 1 good thing - 1 goal - will always pursue it

3-rewards - 2 good things - 2 goals - many ways to balance how you pursue both - a few hits of middle r may compensate for losing one high r - may pursue middle r and ignore higher one



99 percent of model entries will be 0

To build a model of the "laws of physics" we need a huge data structure   x,a,y -> [0,1] in which 99 % of the entries will be zero.
e.g. Let x = Waterhole is N, a = move N, y = Waterhole is here.
Pxa(x) = 0.5
Pxa(y) = 0.5
Pxa(w) = 0 for all the millions of w other than x,y
Millions of them because the waterhole is actually only 1 dimension of the entire n-dimensional input state:
Pxa(w,*,*,*,...,*) = 0

e.g. Chess.
x = opening configuration of board
a = move white pawn
Opponent then moves.
Observe new state y
Pxa(y) = 0 for billions of states y (e.g. all check states, all states without full no. of pieces)

In most worlds, for any given x,a pair, the probability of going to millions of y is zero.




Redoing updates with more accurate Q(y,b)

As Q-values improve in accuracy, we need to re-visit updates we already did. We previously updated:

The r was accurate.
Q(y,b) was totally inaccurate, however, so once we learn an accurate value for Q(y,b), we should really do this update again.



Replay

If we do store a world model, then we can re-do updates in our head, factoring in more accurate Q(y,b)'s.

e.g. In waterhole above, we have explicitly stored the fact that y,b -> z
So as soon as maxd Q(z,d) increases (i.e. as soon as we discover action c), we can reflect this change in Q(y,b), without having to revisit y. And we know that x,a -> y so we can further update Q(x,a) without having to revisit x.

Need to be careful about computational cost of this backtracking:

  1. How find previous state? Many states lead to z. Search all states? Have a reverse index?
  2. And which states lead to those? Many states lead to y which leads to z. How many levels back do we go? This will take time.
  3. Probabilistic world: State leads to z only some times. If you store   x,a,y -> [0,1] find the states that lead to z (Ever? More than p=0.5?).

There is no limit to this replay: As Q(x,a) gets more accurate, the states that lead to x can be updated with more accurate next-Q-value estimates, and so on. z itself may eventually lead to x in a big loop, so that gets updated, and then x leads eventually to z, so that gets updated again, and so on. We can replay any finite set of experiences forever, updating perhaps all Q-values in the system many times before whole system finally converges.



Discussion

"Dreaming"? - "Reasoning"? - Explicit memories.

Things you can do offline: Are these perhaps best done when the brain is not otherwise occupied? i.e. Perhaps best done when asleep?


If we don't store a model, we have to wait until y leads to z again before we update Q-values for y (and wait until x leads to y again, etc.) - Implicit memories (Q-values reflect our past experience implicitly).


Model good for changing worlds/problems. Also good if experiences are costly / rare.

Instead of building model (huge data structure, mostly redundant, i.e. mostly 0's) we could just store whatever experiences we had and then replay.

Replay more complex with probabilistic world - must be fair.
e.g.   x,a -> y with probability 0.1,   x,a -> z with prob. 0.9.
Say you just experience   x,a -> y once, and then replay it 20 times.
- A model may be more trouble than it's worth.




Has the world changed?

Model is useful for changing world - easier to spot that dynamics of the world have changed if you have a prediction of what they should have been.

Still tricky to detect because MDP allows probabilistic world.


How do you detect world has changed in a MDP?

  1. Observe x,a -> y twice. Then x,a -> z
    - Has the world changed? Or are we just learning more about the same MDP?

  2. Observe x,a -> y 1000 times in a row. Then x,a -> z 1000 times in a row.
    - Has the world changed? (Consider, if it is all the same MDP, what Pxa(y) and Pxa(z) might be. Then how likely get 1000 in a row?)


Non-Markov world allows probabilities to change - and later change back!

Very hard to detect:
  1. Your MDP has changed to another MDP.
  2. Your MDP has changed to a non-Markov world.
  3. Your world was never a MDP (as you thought) but actually a non-Markov world.
  4. Your non-Markov world has changed to another non-Markov world.
  5. Your non-Markov world has changed to a MDP.
  6. Your world was never a non-Markov world (as you thought) but actually a MDP.
Further reading:



"Go to state z"

Model useful when give the agent new goals in the same world: "Go to state z". See Feudal Q-learning.



Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.