Let be samples of a stationary random variable

**Proof:**
D's updates go:

i.e. D is simply the average of all samples so far.

As , .

to mean we adjust the estimate D *once* in the direction of the sample *d*.

- If we store D explicitly as a variable, we may update:
where is the

*learning rate*.If

`d < D`then the running average D goes down.

If`d > D`then the running average D goes up. - If D is not stored as a variable, but is the
*output*of a neural network, we may backpropagate the error*D*-*d*.

The notation:

means that over many such updates
the estimate D converges to the expected value of the samples *d*.

Let D be updated by:

where *d* is bounded by , ,
and the initial value of
.
Then:

**Proof:**
The highest D can be is if it is always updated with :

so . Similarly .

α must start at 1
to get rid of D_{init}

Thereafter α can be anything from 0 to 1.

It only makes sense if it is between 0 and 1.

e.g. Imagine if
α
starts at 1, and
thereafter we let
α = 11.

Say we get a non-zero sample d once, then 0 forever.

Samples:
d, 0, 0, ....

Then D ends up with no apparent relationship to the samples d and 0.

And D is not even bounded.
It goes:

D := 0 + 1.d (since α starts at 1)

D := -10 d + 0

D := -10 (-10 d) + 0

D := - 10^{3}d

D := 10^{4}d

D := - 10^{5}d

....

D not bounded

α not between 0 and 1 makes no sense.

An alternative approach is to build up not the average
of all samples, but a *time-weighted* sum of them,
giving more weight to the most recent ones.
This is accomplished by setting to be constant,
in which case D's updates go:

Since as , this weights more recent samples higher.

Typical learning rate might be .