Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org


Reinforcement Learning Notes



The basic algorithm of the agent:

do forever
  observe x
  suggest a according to some decision-algorithm
  execute a
  observe y
  receive reward r(y) or r(x,y)
  modify decision-algorithm
  repeat

a can be "no-op".
r can be 0 every step when nothing happens.

Sample definitions of x:



Probabilistic Transitions

Start in state x:

In state x,
taking action b gives a guaranteed reward of 1.5.
Taking action a normally leads to a reward of only 1, but sometimes leads to a reward of 5.

Pxa(y) = 0.8
Pxa(z) = 0.2
Pxa(w) = 0 for all w other than y,z
(probabilistic transition)

Pxb(s) = 1
Pxb(w) = 0 for all w other than s
(deterministic transition)




Non-Markovian worlds

x = ( direction )
Consider:
P ( xt+1=bump   |   xt=st,   at=bt,   xt-1=st-1,   at-1=bt-1,   ... )


P ( bump | wall is N, move N, wall is N, move N, wall is N, move N )
is higher than:
P ( bump | wall is N, move N, wall not visible, move N, wall not visible, move N )

i.e. There isn't simply a constant:
P ( bump | wall is N, move N )


Q. What simple change would turn the above into a MDP (in fact, a deterministic MDP)?

Consider:
  1. Deterministic world that looks deterministic.
  2. Deterministic world that looks probabilistic.

  3. Probabilistic world that looks probabilistic.
  4. Probabilistic world that looks deterministic.




2.1.3 - Expected Reward

In the above, How do we decide which action to take?
Look at expected reward:
Ea(r) = 0.2 (5) + 0.8 (1)
          = 1.8
Eb(r) = 1 (1.5)
          = 1.5
Action a is better.

What do the probabilities mean?
They mean that if you take a 100 times, you expect 20 5's and 80 1's. Total reward 180.
If you take b 100 times you expect 100 1.5's. Total reward 150.

Maximise E(r).


rmin ≤ E(r) ≤ rmax

If all rewards between rmin and rmax
E(r) = p1 r1 + ... + pn rn
≥ p1 rmin + ... + pn rmin
= 1 . rmin


E(r) exists, E(y) doesn't exist

Imagine a state definition for a robot carrying blocks to put in a bin:
x = (x1,x2)
x1 = what you are carrying: 0=not carrying block, 1=carrying block
x2 = direction of bin, values 0-9

You are in state x = (1,5).
Imagine that if you take some action a, you go to state y with probabilities:


y               Pxa(y)          reward

(1,8)           0.4             10
(0,8)           0.3             5
(1,5)           0.2             1
(0,5)           0.1             0

You can calculate E(r).
Q. What is it?

But "E(y)" is meaningless.
Consider the sum of the y's times their probabilities
= (1,8) 0.4 + (0,8) 0.3 + (1,5) 0.2 + (0,5) 0.1
= (0.6, 7.1)

You are never in this state when you take a.
In fact, this state doesn't exist.
"0.6" doesn't have a meaning.
The state space is not a continuous space. It is a discrete space of points.



r(x,y)

Typically, we don't reward for the correct action, r = r(x,a):

in state x
if took action b
  then reward 1
else
  reward 0
because we don't know the correct action!
Instead we reward on getting to some state (somehow, we don't know how), r = r(x,y):
in state x
if got to state z
  then reward 1
else
  reward 0




e.g. In the following:

Taking action a in state x,
P(r=5) = 0.2
P(r=0) = 0.4
P(r=3) = 0.3 + 0.1
          = 0.4
Question - What is E(r)?



Reward on Continuous or Transition?

Reward on transition to good state only: r = r(x,y).
e.g. x = not at waterhole, y = at waterhole. Problem - may leave state to come back in! See Discussion.

Reward for arriving and staying in good state: r = r(y).




2.1.4 Deterministic v. Stochastic Policy

a       a1   a2    a3
Q(x,a)  0    5.1   5
Deterministic policy - When in x, always take action a2.

Stochastic policy - When in x, 80 % of time take a2, 20 % of time take a3.

Typically - Start stochastic. Become more deterministic as we learn.




Short-term loss for Long-term gain

Not just maximise immediate E(r) (i.e. deal with probability).

Maximise long-term sum of expected rewards. This may involve taking a short-term loss for long-term gain (equivalent to allowing a move downhill in hill-climbing).




Discounted Reward

0 ≤ γ < 1

(e.g. γ = 0.9)


If rewards can be positive or negative, R may be a sum of positive and negative terms.

Special case: γ = 0
Doesn't work: γ = 1 (see later)

The interesting situation is in the middle (0 < γ < 1)


Short-term loss for Long-term gain

r in the future can even override immediate punishment

If we decide what to do based on which path has highest R, then [punishment now, r in the future] can beat [reward now, but not in future] if:

  1. r is big enough
  2. r is near enough - not discounted by γn
  3. both of the above




Bounds

If maximum reward is rmax, what is maximum possible value of R?
The most it could be would be to get the maximum reward every step:


What is this quantity on the right? Let:

Multiply by γ:

Subtract the 2 equations:

And we get a value for X:

That is:

See Appendix A and Appendix B.

This is just the Negative Binomial Series in the Binomial Theorem.
(x+1)(-1) = 1 - x + x2 - x3 + x4 - ...
Let x = - γ



Exercise




2.1.5 DP

Note in DP if γ = 0 (only look at immediate reward) you only need one iteration of updates and you are done.
You have worked out immediate E(r) completely.



2.1.5 How do I return to state x?

In the real world, we may not know how to get back to state x to try another action:
x = "I am at party"
a = "I say chat-up line c1"
result - punishment

I might want to say "Go back to x, try c2" but I may not know how to get back to x. All I can do is try out actions in the states in which I happen to find myself. I have to wait until state x happens to present itself again.

Q-learning can work even when states and actions are experienced haphazardly.




2.1.5 - DP and Q-learning

Q-learning is for when you don't know Pxa(y) and Pxa(r).

But one approach is just:
Interact with world large number of times to build up a table of estimates for Pxa(y) and Pxa(r).
Then just do offline DP.




2.1.5 Q-learning

We keep Q-values Q(x,a) for each (x,a) pair.
We tend to take the action with the highest Q-value:

maxb Q(x,b)

The Q-value will express the discounted reward above:

Can be built up recursively:




Sampling

Using this update rule:

the Q-value will bounce back and forth:

Q(x,a) := 5
Q(x,a) := 10
Q(x,a) := 6
...

More sensible to build up an average:

Q(x,a) := 5
Q(x,a) := 1/2 (5 + 10)
Q(x,a) := 1/3 (5 + 10 + 6)
...

See Appendix A for how to build up a running average like this without needing infinite memory.

In general, we use the Notation of updating the running average D "in the direction of" the latest sample d, rather than assigning directly to it:

If   d < D   then the running average D goes down. If   d > D   then the running average D goes up.

So the Q-value is updated in the direction of the current sample:



D is bounded

Theorem B.1 - If d bounded, then D bounded.

α must start at 1 to get rid of Dinit

Thereafter α can be anything from 0 to 1.



Q is bounded

Theorem B.2

i.e. If take this action, you will get immediate reward rmax
and you can get next reward rmax (it is the best you can get in next state, other actions may be worse)
and you can get next reward rmax ..
....

i.e. If take this action, you will get immediate reward rmin
and you have to get next reward rmin (all actions in next state lead to the worst reward rmin, otherwise the Q-value would be higher)
and you have to get next reward rmin ..
....

i.e. You are trapped in an attractor from which you cannot escape.
You cannot ever again get out and go to a state with rewards greater than rmin (and such states must exist).
Not necessarily a single absorbing state, but a family of states outside of which you can never leave.




Note if expected reward E(r) = rmin, then you will definitely get rmin on this action (r deterministic, though y can still be probabilistic).
If any reward r   >   rmin has non-zero probability, then E(r)   >   rmin

E(r) = p rmin + (1-p) r
> p rmin + (1-p) rmin
= rmin



How Q-learning works




Building a world model




Convergence




The control policy




Appendix A


Theorem A.2

If start at:   α = 1/t   then initial Q-values bias our Q-values for some time.
And since we only run for finite time in any finite experiment, the bias may still be there after learning.

Consider being "born" with Q-values already filled in (i.e. in DNA) and then start learning:




Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.