Q-learning with a Neural Network:

Input x,a.

Output y_{k} = Q(x,a).

We are

O

Instead we are learning from

so for example in discrete case we do:

that is:

In the neural network Q-learning, we backpropagate the error:

But of course the term:

is just an estimate, and Q itself is changing
as we go along.

The "timeless" information is that x,a led to y,r. We can save these 4 values and "replay" the experience many times, with improved values of Q.

Read discussion of replay.

There are lots of interesting issues. For example, replay:

(x,a) -> (y,r)a million times and it learns that

Also *random* learning, which worked with lookup tables,
won't work with neural nets,
because the exemplars interfere with each other.
The net will just learn that all actions lead to nothing.
We will need a more intelligent control policy,
something like a Boltzmann distribution.