We are actually going to get rid of the thresholds for the moment (explained later), so that we have:
We sent in an input, and it generated, in the output nodes,
a vector of outputs yk.
The correct answer is the vector of numbers Ok.
The error term is:
We take the squares of errors, otherwise positive and negative errors may cancel each other out.
There are other possible measures of error (recall Distance in n-dimensions) but we can agree that if this measure -> 0 then all other measures of error -> 0
Q. Prove that if this measure of E = 0 then yk = Ok for all k.
To look at how to reduce the error, we look at how the error changes as we change the weights. We start at the layer immediately before the output. Working out the effects of earlier layers will be more complex.
First we can write total error as a sum of the errors at each node k:
E = Σ k Ekwhere Ek = 1/2 (yk - Ok)2
Now note that
yk, xk and wjk
each only affect the error at one particular output node k
(they only affect Ek).
So from the point of view of these 3 variables, total error:
E = (a constant) + (error at node k)hence:
(derivative of total error E with respect to any of these 3 variables) = 0 + (derivative of error at node k)e.g.
∂E/∂yk = 0 + ∂Ek/∂ykWe can see how the error changes as yk changes, or as xk changes. But note we can't change yk or xk - at least not directly. They follow in a predetermined way from the previous inputs and weights.
But we can change wjk
As we work backwards, the situation changes. yj feeds forward into all of the output nodes. Since:
E = (sum of errors at k)we get:
(derivative of E) = (sum of derivatives of error at k)xj and wij then only affect yj (though yj affects many things).
We can't (directly) change yj or xj
But we can change wij
To spell it out:
∂E/∂yj = Σ k ∂Ek/∂yj
= Σ k ∂Ek/∂xk ∂xk/∂yj
= Σ k ∂E/∂xk ∂xk/∂yj
Now we have an equation for each - how error changes as you change the weight.
Note some things:
Now, to reduce error:
On LHS, slope is negative:
Move right to reduce error:
W := W - C
= W - C (negative quantity)
= W + (positive quantity)
Hence the same update rule works for both positive and negative slopes:
W := W - C
The constant C is the LEARNING RATE 0 < C < 1.
Compare Perceptron Learning Rule.
Note that is large when:
What is also often done is that the learning rate C starts high and declines as learning goes on.
Let us consider why a "noisy" gradient descent (large jumps, large weight changes, including possible jumps uphill) - where the noise dies down as time goes on (approach area of global minimum, smaller jumps, smaller weight changes, settle in minimum) - is often a better idea than a strict downhill descent. Consider a landscape like the following:
If the ball strictly moves downhill, and it starts at a random point, it is more or less equally likely to end up in the local minimum as in the global one.
If we shake the whole system, however, we are more likely to shake it from local to global than vice versa. If we shake violently, getting it out of minima becomes very probable, but at the cost of making it more likely to actually shake it out of the global minimum into a local one. If we shake gently, getting it out of minima is harder but when it does it is more likely to go from local to global.
So what we do is start by shaking violently, and gradually shake less as time goes on.
It seems unsatisfactory that we have a method that cannot guarantee to find the global minimum.
The problem comes from the blindness of our exploration. Below is the actual landscape. But we are never actually shown the overall shape of the landscape. All we do is suggest points on the X-axis (weights), and are told the point on the Y-axis. It's like we have a torch that shows us just the point on which we are standing, and everything else is in darkness.
The problem is we do not actually roll down hill (continuously). Rather we make a finite number of jumps, and we only ever explore a finite number of points.
We can repeat this for lots of points, but we need an infinite number of points to be sure that we have explored the whole curve.
If the curve is sufficiently fine-detailed (recall discussion of chaotic function), then we are not guaranteed to find that global minimum without an infinite amount of exploration:
X-axis - weight parameter.
Y-axis - error. Change weight parameter to find minimum error.
Hard to find minimum error unless program can see the whole function (as human can here).
If explore by finite sampling of a few points, may never find the pit.
Note the pit cannot have vertical sides. To be a function (one y for each x) its sides must have slope, no matter how steep.
Image thanks to Victor Terron, replacing my poorly hand-drawn version.
Note if by chance we land in the pit, a point on one of its steep walls, we will find very steep slope, hence big change in w. Hence we jump out of the pit!
In any case, we can make the mouth of the pit arbitrarily small, reducing to near zero the prob. of ever finding it no matter what you use. The point remains that no algorithm of finite sampling can ever guarantee to find the global optimum.
We will return to this issue with evolutionary methods, which are even more blind. At least here, we have a measure of E and a measure of
In a genetic algorithm we do not even have this information. But of course that is the whole point of these methods - to be able to learn despite the fact that we have less and less information / feedback.
With single-layer nets, for a function that is linearly separable, where we know that some configuration of the network can separate them, there can be a Convergence Proof, in the sense that we can prove that the network will learn to separate the exemplars.
For a multi-layer net, what does a Convergence Proof mean? That we can approximate the function f? To what level of accuracy? Anyway, which function? A more complex function requires a larger network. Maybe with the current network architecture we can only represent f crudely. Convergence Proof can prove that it converges to something, i.e. stops changing. But perhaps to local optimum rather than global optimum.
Rumelhart et al, 1986, do not have a proof like Rosenblatt, but can only argue for back-prop as a heuristic, which normally (but not always) works well.
The situation is summed up in Back Propagation is Sensitive to Initial Conditions, John F. Kolen, Jordan B. Pollack, 1990:
Rumelhart et al. made a confident statement: for many tasks, "the network rarely gets stuck in poor local minima that are significantly worse than the global minima."(p. 536)
The convergence claim was based solely upon their empirical experience with the back propagation technique. Since then, Minsky and Papert (1988) have argued that there exists no proof of convergence for the technique, and several researchers (Judd 1988; Blum and Rivest 1988; Kolen 1988) have found that the convergence time must be related to the difficulty of the problem, otherwise an unsolved computer science question (P=NP) would finally be answered. We do not wish to make claims about convergence of the technique in the limit (with vanishing step-size), or the relationship between task and performance, but wish to talk about a pervasive behavior of the technique which has gone unnoticed for several years: the sensitivity of back propagation to initial conditions.
Kolen and Pollack basically point out that, as a heuristic, it will find some local optimum. Given the same training set, which local optimum it finds is decided entirely by the initial weights. In general it is not feasible to perform a global search to obtain the optimal set of weights.
More on that point. If we did not have a value for we could in theory search for the correct network by just doing an exhaustive search on all possible sets of weights. Search space too big.
constrains our search, but does so by limiting from the start the solutions we can find.
Also worth reading is No Harm Intended: A Review of the Perceptrons expanded edition, Pollack, 1989, where he points out that this is just normal in AI (to have a heuristic that cannot search the whole space):
Minsky and Papert point out, rightly, that Rumelhart et al have no result, like the Perceptron Convergence Theorem (Rosenblatt, 1962), that the generalized delta rule is guaranteed to find solutions when they exist, and that "little has been proved about the range of problems upon which [the generalized delta rule] works both efficiently and dependably." (p. 260.)
On the other hand, we also know that for certain classes of very hard problems, there are no algorithms which scale significantly better than exhaustive search, unless they are approximations. But there are many such impractical and approximate methods which are in practical use worldwide. Despite the fact that the Traveling Salesman Problem is NP-complete, salespersons do travel, and often optimally.
Note that yk and Ok are both 0 to 1.
So is from -1 to 1.
is from -1 to 1.
So our actual update of the final layer weights wjk is from:
wjk := wjk - C
wjk := wjk + C
definition of threshold:
sigmoid(x-t) instead of just sigmoid(x)
Biasing: Just make the thresholds into weights, on a link with constant input -1.
So instead of:
we change it to:
i.e. Every single hidden unit (and every single output unit) has an extra input line coming apparently from nowhere with constant input -1.
Remember we need thresholds, otherwise the sigmoid function is centred on zero. e.g. If no thresholds, then, no matter what the exemplars are, no matter what the I/O relationship is, and no matter what the weights are, if all inputs are 0, then output of every single hidden node is ..
Q. Is what?
Similarly, for a node that specialises on n of the inputs (weight = 0 for others), then if that subset of inputs are all 0, that node's output must be ..
We said we don't have to design the network, and this is true, as regards weights and thresholds.
However, we still had to make sure that there were enough input nodes to actually receive the input vector, and same with output. And there are further design decisions we made before learning even started:
A neural net is an approximation of a non-linear continuous function.
How good that approximation is depends crucially on how you designed it in the first place.
Backprop can't do everything.
e.g. For the function:
f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)you can't design a network with 1 input, 1 hidden unit, 1 output, and expect backprop to finish the job. There is only so much such a network can represent.
With any neural net, you should design it with the approximation in mind. Design it with some a priori estimate of what the approximation will look like.
Design is part of the approximation process.
Backprop finishes the details.
And it's not simply a matter of having thousands of hidden units. That would tend towards a lookup table with limited generalisation properties. It is an empirical loop:
repeat design network architecture use backprop to fill in weights if too much interference (can't form representation) increase number of hidden units if too little interference (can't predict new input) reduce number of hidden unitsFurther reading - Designing the Inputs.
See Using net as function approximator.
1 real input, n hidden, 1 real output.
Never sees the same exemplar twice!
Consider the function:
f(x) = sin(x) + sin(2x) + sin(5x) + cos(x)that we talked about earlier. We can now reveal that the learner of this function was an initially-random neural network with 1 input, 12 hidden units, and 1 output. We saw its performance after 1 million exemplars, and how it had improved by the time it had seen 5 million exemplars.
It has difficulty representing f because there are basically too few hidden units, so it can only form a crude representation.
Remember the network has not increased in size to store this more accurate representation. It still has just 12 hidden units. It has merely adjusted its weights.