Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Q-learning

Recall we are trying to maximise long-term discounted reward.
Taking action a in state x yields long-term discounted reward:




Q(x,a) values

The Q-learning agent builds up Quality-values (Q-values) Q(x,a) for each pair (x,a).

We tend to take the action with the highest Q-value:

maxb Q(x,b)

The Q-value will express the discounted reward if you take that action in that state:

Can be built up recursively:




Naive attempt at a learning algorithm

You are in state x, take action a, get reward r, new state y. Then update:

Q(x,a) := r + γ maxb Q(y,b)

Recall Assignment Notation


Problem: We allow a probabilistic world (MDP), where taking the same action a in same state x can have different results.
r may vary. y may vary.
Want to average these results instead of Q-value just expressing the most recent one.

With the above rule, the Q-value will bounce back and forth:

Q(x,a) := 5
Q(x,a) := 10
Q(x,a) := 6
...

More sensible to build up an average:

Q(x,a) := 5
Q(x,a) := 1/2 (5 + 10)
Q(x,a) := 1/3 (5 + 10 + 6)
...

How do we do this? Do we need an ever-growing memory of all past outcomes?




Building up a running average




Core Q-learning algorithm

In 1-step Q-learning, after each experience, we observe state y, receive reward r, and update:

(remember notation).

The Q-value is updated in the direction of the current sample.

Note that we are updating from the current estimate of Q(y,b) - this too is constantly changing.

If we store each Q(x,a) explicitly in lookup tables, we update:



Q-values are bounded

Since the rewards are bounded, it follows that the Q-values are bounded.

theorem3241

Proof: In the discrete case, Q is updated by:

displaymath6516

so by Theorem B.1:

displaymath9330

This can also be viewed in terms of temporal discounting:

displaymath9331

Similarly:

displaymath9332

tex2html_wrap_inline7352

For example, if tex2html_wrap_inline6480 , then tex2html_wrap_inline9342 . And (assuming tex2html_wrap_inline9344 ) as tex2html_wrap_inline9346 , tex2html_wrap_inline9348 .

Note that since tex2html_wrap_inline6426 , it follows that tex2html_wrap_inline9352 .



Qmax and Qmin


i.e. If take this action, you will get immediate reward rmax
and you can get next reward rmax (it is the best you can get in next state, other actions may be worse)
and you can get next reward rmax ..
....

i.e. If take this action, you will get immediate reward rmin
and you have to get next reward rmin (all actions in next state lead to the worst reward rmin, otherwise the Q-value would be higher)
and you have to get next reward rmin ..
....

i.e. You are trapped in an attractor from which you cannot escape.
You cannot ever again get out and go to a state with rewards greater than rmin (and such states must exist).
Not necessarily a single absorbing state, but a family of states outside of which you can never leave.




Note if expected reward E(r) = rmin, then you will definitely get rmin on this action (r deterministic, though y can still be probabilistic).
If any reward r   >   rmin has non-zero probability, then E(r)   >   rmin

E(r) = p rmin + (1-p) r
> p rmin + (1-p) rmin
= rmin



Feeds      w2mind.org

On Internet since 1987.