Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Learning rate that does not start at 1

Recall convergence conditions.

The typical tex2html_wrap_inline6315 goes from 1 down to 0, but note that if the conditions hold, then for any t, tex2html_wrap_inline6803 and tex2html_wrap_inline6805 , so tex2html_wrap_inline6315 may start anywhere along the sequence. That is, tex2html_wrap_inline6315 may take successive values tex2html_wrap_inline6811




Q-learning will forget bad samples at the start

To "forget" old stuff, you could reset α = 1.
But in fact you don't have to:
α may start anywhere along the sequence and conditions for convergence satisfied. So can just keep learning and old stuff is eventually wiped.

e.g. Say world changes from MDP1 to MDP2 after time t. Just keep going with Q-learning and will learn optimal policy for MDP2 (eventually) and will forget what it learnt for MDP1 (eventually). No need to change anything.

Q-learning automatically adapts if world/problem changes.



Starting α at 1/t

Recall our running average.

Let tex2html_wrap_inline9198 be samples of a stationary random variable d with expected value E(d). Repeat:

displaymath9186


theorem2967

Proof: D's updates go:

displaymath9188

As tex2html_wrap_inline9226 :

displaymath9189

that is, tex2html_wrap_inline9214. tex2html_wrap_inline7352



One way of looking at this is to consider tex2html_wrap_inline9232 as the average of all samples before time t, samples which are now irrelevant for some reason. We can consider them as samples from a different distribution f:

displaymath9190

Hence:

displaymath9191

as tex2html_wrap_inline9226 .


Because:
1/n ( dt + ... + dn ) = (n-t+1)/n   1/(n-t+1) ( dt + ... + dn )
->   1 . E(d)



Initial bias

If start at:   α = 1/t   then initial Q-values bias our Q-values for some time.
And since we only run for finite time in any finite experiment, the bias may still be there after learning.

Consider being "born" with Q-values already filled in (i.e. in DNA) and then start learning:




Not-quite Lamarckism in nature



Feeds      w2mind.org

On Internet since 1987.