School of Computing. Dublin City University.
Let be samples of a stationary random variable d with expected value E(d). Then the following update algorithm provides an elegant way of sampling them. Repeat:
Proof: D's updates go:
i.e. D is simply the average of all samples so far.
As , .
to mean we adjust the estimate D once in the direction of the sample d.
where is the learning rate.
d < D
then the running average D goes down.
If d > D then the running average D goes up.
means that over many such updates the estimate D converges to the expected value of the samples d.
Let D be updated by:
where d is bounded by , , and the initial value of . Then:
Proof: The highest D can be is if it is always updated with :
so . Similarly .
α must start at 1 to get rid of Dinit
Thereafter α can be anything from 0 to 1.
e.g. Imagine if
starts at 1, and
thereafter we let
α = 11.
Say we get a non-zero sample d once, then 0 forever.
Samples: d, 0, 0, ....
Then D ends up with no apparent relationship to the samples d and 0.
And D is not even bounded. It goes:
D := 0 + 1.d (since α starts at 1)
D := -10 d + 0
D := -10 (-10 d) + 0
D := - 103 d
D := 104 d
D := - 105 d
D not bounded
α not between 0 and 1 makes no sense.
An alternative approach is to build up not the average of all samples, but a time-weighted sum of them, giving more weight to the most recent ones. This is accomplished by setting to be constant, in which case D's updates go:
Since as , this weights more recent samples higher.
Typical learning rate might be .