Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Building up a running average

How to build up a running average of samples without needing an ever-growing memory of all samples as you go along.

Let tex2html_wrap_inline9198 be samples of a stationary random variable d with expected value E(d). Then the following update algorithm provides an elegant way of sampling them. Repeat:

displaymath9186



theorem2921

Proof: D's updates go:

displaymath9187

i.e. D is simply the average of all tex2html_wrap_inline9210 samples so far.

As tex2html_wrap_inline9212, tex2html_wrap_inline9214.



Notation: "Adjust in direction of"

I use the notation:

displaymath6319

to mean we adjust the estimate D once in the direction of the sample d.


  1. If we store D explicitly as a variable, we may update:

    displaymath6320

    where tex2html_wrap_inline6315 is the learning rate.

    If   d < D   then the running average D goes down.
    If   d > D   then the running average D goes up.

  2. If D is not stored as a variable, but is the output of a neural network, we may backpropagate the error D - d .


The notation:

displaymath6321

means that over many such updates the estimate D converges to the expected value of the samples d.




If d is bounded, then D is bounded

Let D be updated by:

displaymath6289

where d is bounded by tex2html_wrap_inline9310 , tex2html_wrap_inline9312 , and the initial value of tex2html_wrap_inline6704 . Then:

theorem3220

Proof: The highest D can be is if it is always updated with tex2html_wrap_inline9310 :

displaymath9303

so tex2html_wrap_inline9322 . Similarly tex2html_wrap_inline9324. tex2html_wrap_inline7352

α must start at 1 to get rid of Dinit

Thereafter α can be anything from 0 to 1.



Learning rate must be between 0 and 1

Learning rate α is like a probability.
It only makes sense if it is between 0 and 1.

e.g. Imagine if α starts at 1, and thereafter we let α = 11.
Say we get a non-zero sample d once, then 0 forever.
Samples: d, 0, 0, ....
Then D ends up with no apparent relationship to the samples d and 0.
And D is not even bounded. It goes:

D := 0 + 1.d     (since α starts at 1)
D := -10 d + 0
D := -10 (-10 d) + 0
D := - 103 d
D := 104 d
D := - 105 d
....
D not bounded

α not between 0 and 1 makes no sense.



Constant learning rate (time-weighted)

An alternative approach is to build up not the average of all samples, but a time-weighted sum of them, giving more weight to the most recent ones. This is accomplished by setting tex2html_wrap_inline6284 to be constant, in which case D's updates go:

displaymath9240

Since tex2html_wrap_inline9244 as tex2html_wrap_inline9226 , this weights more recent samples higher.

Typical learning rate might be tex2html_wrap_inline9248 .




Feeds      w2mind.org

On Internet since 1987.