Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

My big idea: Ancient Brain


CA216      CA249      CA318

CA400      CA651      CA668

Building up a running average

How to build up a running average of samples without needing an ever-growing memory of all samples as you go along.

Let tex2html_wrap_inline9198 be samples of a stationary random variable d with expected value E(d). Then the following update algorithm provides an elegant way of sampling them. Repeat:



Proof: D's updates go:


i.e. D is simply the average of all tex2html_wrap_inline9210 samples so far.

As tex2html_wrap_inline9212, tex2html_wrap_inline9214.

Notation: "Adjust in direction of"

I use the notation:


to mean we adjust the estimate D once in the direction of the sample d.

  1. If we store D explicitly as a variable, we may update:


    where tex2html_wrap_inline6315 is the learning rate.

    If   d < D   then the running average D goes down.
    If   d > D   then the running average D goes up.

  2. If D is not stored as a variable, but is the output of a neural network, we may backpropagate the error D - d .

The notation:


means that over many such updates the estimate D converges to the expected value of the samples d.

If d is bounded, then D is bounded

Let D be updated by:


where d is bounded by tex2html_wrap_inline9310 , tex2html_wrap_inline9312 , and the initial value of tex2html_wrap_inline6704 . Then:


Proof: The highest D can be is if it is always updated with tex2html_wrap_inline9310 :


so tex2html_wrap_inline9322 . Similarly tex2html_wrap_inline9324. tex2html_wrap_inline7352

α must start at 1 to get rid of Dinit

Thereafter α can be anything from 0 to 1.

Learning rate must be between 0 and 1

Learning rate α is like a probability.
It only makes sense if it is between 0 and 1.

e.g. Imagine if α starts at 1, and thereafter we let α = 11.
Say we get a non-zero sample d once, then 0 forever.
Samples: d, 0, 0, ....
Then D ends up with no apparent relationship to the samples d and 0.
And D is not even bounded. It goes:

D := 0 + 1.d     (since α starts at 1)
D := -10 d + 0
D := -10 (-10 d) + 0
D := - 103 d
D := 104 d
D := - 105 d
D not bounded

α not between 0 and 1 makes no sense.

Constant learning rate (time-weighted)

An alternative approach is to build up not the average of all samples, but a time-weighted sum of them, giving more weight to the most recent ones. This is accomplished by setting tex2html_wrap_inline6284 to be constant, in which case D's updates go:


Since tex2html_wrap_inline9244 as tex2html_wrap_inline9226 , this weights more recent samples higher.

Typical learning rate might be tex2html_wrap_inline9248 .

Feeds      w2mind.org      ancientbrain.com

On the Internet since 1987.