School of Computing. Dublin City University.
Online coding site: Ancient Brain
We have defined the Back-propagation algorithm. But there is still a lot of work for the human to do in making this work.
The whole point of learning is
not to design the network.
However this is only true for not designing weights and thresholds.
There are still many design decisions. For example:
e.g. For the function:
f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)you can't design a network with 1 input, 1 hidden unit, 1 output, and expect backprop to finish the job. There is only so much such a network can represent.
Design is part of the approximation process. Backprop finishes the details.
And design is not easy. It's not simply a matter of having thousands of hidden units. That would tend towards a lookup table with limited generalisation properties. It is an empirical loop:
repeat design network architecture use backprop to fill in weights if too much interference (can't form representation) increase number of hidden units if too little interference (can't predict new input) reduce number of hidden units
The network needs to be able to separate certain areas of the input space from other areas. A lot of work may have to be put into clever coding of the inputs to help the network do this.
3 inputs each take integer values 0 .. 9 1 input takes integer values 0 .. 8 1 input takes value 0 or 1
Possible input schemes:
One scheme that could work in the "sweet spot" is "1-of-C" encoding.
Yes, we need a prediction machine that can generate a guess for unseen inputs. But how about inputs we saw before? Why "forget" anything that we once knew? Surely forgetting things is simply a disadvantage.
We could have an extra lookup table on the side, to store the results of every input about which we knew the exact output, and only consult the network for new, unseen input.
The question is: Would this table ever be used? That is, would we ever see the same input twice?
If exact same input never seen twice, our lookup-table grows forever (not finite-size data structure) and is never used. Even if it is (rarely) used, consider computational cost of searching it.
In the above, if the feedback the neural net gets is the same
in the area 240 - 260 degrees,
then it will develop weights and thresholds so that any
continuous value in this zone generates roughly the same output.
On the other hand, if it receives different feedback in the zone around 245 - 255 degrees than outside that zone, then it will develop weights that lead to a (perhaps steep) threshold being crossed at 245, and one type of output generated, and another threshold being crossed at 255, and another type of output generated.
The network can learn to classify any area of the multi-dimensional input space in this way. This is especially useful for:
Network can start with random values and learn to get rid of these.
But of course that means it can learn to get rid of good values over time as well. It can't tell the difference.
If it doesn't see an exemplar for a while, it will forget it. For all it knows, it has just started learning, and the weights it has now are just a random initialisation! It keeps learning, wiping out anything too far in past.
Learning = Forgetting!
e.g. Extreme Case - We show it one exemplar repeatedly. e.g. Show it "Input x leads to Output 1", 1 million times in a row. The "laziest" way for the network to represent this is to just send the weights to infinity (or minus infinity for Input negative), so Output = 1 no matter what the Input. i.e. Instead of "x -> 1" it learns "* -> 1"
If we show it "x -> 1" a million times, then all weights may be recruited to help "x -> 1". Normally, if we show it "x -> 1" then it does have an effect on all weights, but this effect is countered by the effects of other exemplars. The way the net resolves this tension is by specialisation, where some weights are more-or-less irrelevant in some areas of the input space. Since they have little (though, if outputs are continuous, it will always be at least non-zero, no matter how tiny) effect on the error, the backprop algorithm ensures they are hardly modified. Then when we show it "x -> 1" once, it does have an effect on the weight, but the effect is negligible.
How does the process of specialising work? - As the net learns, it finds that for each weight, the weight has more effect on E for some exemplars than others. It is modified more as a result of the backprop from those exemplars, making it even more influential on them in the future, and making the backprop from other exemplars progressively less important.
First, exemplars should have a broad spread. Show it "x -> y" alright, but if you want it to learn that some things do not lead to y you must show it explicitly that "(NOT x) -> (NOT y)". e.g. In learning behaviour, some actions lead to good things happening, some to bad things, but most actions lead to nothing happening. If we only show it exemplars where something happened, it will predict that everything leads to something happening, good or bad. We must show it the "noise" as well.
How do we make sure it learns and doesn't forget? - If exemplars come from training set, we just make sure we keep re-showing it old exemplars. But exemplars may come from the world. So it forgets old and rare experiences.
One solution is to have learning rate C decline over time. But then it doesn't learn new experiences.
Possible solution if exemplars come from world is internal memory and replay of old experiences. Not remembering exemplars as lookup table, but remembering them to repeatedly push through network. But same question as before - Does the list grow forever?
f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)
The following uses the C++ code for neural network as function approximator.
1 real input, n hidden, 1 real output.
Never sees the same exemplar twice!
The network after having seen 1 million exemplars (top) and 5 million (bottom):
The learner of this function was an initially-random neural network with 1 input, 12 hidden units, and 1 output.
The reason why it has difficulty representing f is because there are too few hidden units, so it can only form a crude representation.
Remember the network has not increased in size to store this more accurate representation. It still has just 12 hidden units. It has merely adjusted its weights.